Recent

Author Topic: AVX and SSE support question  (Read 89908 times)

dzjorrit

  • Newbie
  • Posts: 2
AVX and SSE support question
« on: May 25, 2016, 10:22:03 am »
Hi,
I have the following code:

const
  vectorsize = 4;
type
  tVector=array[0..vectorsize-1] of single;

function vectoradd(a,b:tVector):tVector;
begin
  result:=a+b;
end;
   
This compiles fine when SSE and vector processing are enabled.

But when I increase vectorsize to 8 and enable the AVX compiler options I get this error:
Compile Project, Target: project1.exe: Exit code 1, Errors: 1
unit1.pas(60,12) Error: Internal error 200610072

Is AVX not properly supported yet by fpc or is this a bug?

I'm using lazarus 64bit version 1.6 with fpc 3.0.0 on Windows 10 x64 with AMD A10 AVX enabled processor.

Thanks!
Jorrit





Thaddy

  • Hero Member
  • *****
  • Posts: 14204
  • Probably until I exterminate Putin.
Re: AVX and SSE support question
« Reply #1 on: May 25, 2016, 10:51:56 am »
Which FPC version are you using? That's really important, because AVX is only properly supported from 3.0 and higher.

Ah, I see, 3.0. In that case: how did you compile? An internal error should never happen and should be reported on bugs.freepascal.org. If you ever see an internal error it is a bug by definition.

When you file your bug report give as much information as possible and preferably a complete code example that reproduces the bug.
« Last Edit: May 25, 2016, 10:59:27 am by Thaddy »
Specialize a type, not a var.

dzjorrit

  • Newbie
  • Posts: 2
Re: AVX and SSE support question
« Reply #2 on: May 25, 2016, 11:49:38 am »
Ok, thanks, I will file a bug report soon. I use a new lazarus project only adding the code I produced in my post and having these compiler options specified:
-O4
-CfAVX
-CpCOREAVX
-OpCOREAVX
-Sv
-XX
-CX

Pascal Fan

  • Newbie
  • Posts: 2
Re: AVX and SSE support question
« Reply #3 on: June 11, 2016, 05:46:05 am »
I noticed something else related to the vector processing.  I was playing around with this code this evening, trying a few things, and I noticed that when I used the code posted in this thread with a vector size of 4, and enabled SSE and vector processing with FPC 3.0, I got the same internal error as the poster got when using AVX with a size of 8, if I did an "a xor b" operation instead of an "a + b" operation.  If I did the "a + b" operation as shown in this thread, it works with SSE, but an xor operation will trigger the internal error.  I suspect this isn't correct, because it seems like an xor operation should be possible, and in any event the internal error seems like it's not the correct response.  Anyhow, I thought I should mention it because it looks like there might be some issues with the SSE vector processing as well.
« Last Edit: June 11, 2016, 05:51:36 am by Pascal Fan »

shobits1

  • Sr. Member
  • ****
  • Posts: 271
  • .
Re: AVX and SSE support question
« Reply #4 on: June 11, 2016, 06:13:26 am »
maybe you should refrain from using -O4 since the compiler help screens contains the following:
Code: [Select]
  -O<x>  Optimizations:
      -O-        Disable optimizations
      -O1        Level 1 optimizations (quick and debugger friendly)
      -O2        Level 2 optimizations (-O1 + quick optimizations)
      -O3        Level 3 optimizations (-O2 + slow optimizations)
      -O4        Level 4 optimizations (-O3 + optimizations which might have unexpected side effects)

maybe I'm wrong.

Pascal Fan

  • Newbie
  • Posts: 2
Re: AVX and SSE support question
« Reply #5 on: June 11, 2016, 06:54:07 am »
That's a totally valid point, and you're absolutely correct,  but what I forgot to mention in my post is that my compiles were done with -O3, not the -O4 the original poster used.  So I do think there might be a legitimate bug here.  But you are absolutely correct that -O4 is probably not a good idea!

schuler

  • Full Member
  • ***
  • Posts: 223
Re: AVX and SSE support question
« Reply #6 on: September 19, 2017, 05:02:14 pm »
 :) Hello :)

I have exactly the same problem on FPC 3.0.2 32 bits/windows. Tried with -Cp and -Op COREAVX/COREAVX2 and PENTIUMM.

  :( I can't figure my own login info at bugs.freepascal  :(


marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11383
  • FPC developer.
Re: AVX and SSE support question
« Reply #7 on: September 19, 2017, 05:07:29 pm »
Did you check if something already existed ? :-)

https://bugs.freepascal.org/view.php?id=31612

Thaddy

  • Hero Member
  • *****
  • Posts: 14204
  • Probably until I exterminate Putin.
Re: AVX and SSE support question
« Reply #8 on: September 19, 2017, 05:08:23 pm »
:) Hello :)

I have exactly the same problem on FPC 3.0.2 32 bits/windows. Tried with -Cp and -Op COREAVX/COREAVX2 and PENTIUMM.

  :( I can't figure my own login info at bugs.freepascal  :(

You have to specify  -Sv  otherwise the compiler does not do anything interesting,  Your vector also needs to be a two dimensional vector atm. And you have to specify the alignment.
Specialize a type, not a var.

schuler

  • Full Member
  • ***
  • Posts: 223
Re: AVX and SSE support question
« Reply #9 on: September 19, 2017, 08:58:52 pm »
 :) Hello Thaddy,  :)
Thank you for paying attention. Yes. I do have -Sv .

This is a known bug:
https://bugs.freepascal.org/view.php?id=31612

https://bugs.freepascal.org/view.php?id=30186

In the case that someone is interested, I've just tried this with success:

Code: Pascal  [Select][+][-]
  1. {$ASMMODE intel}
  2.  
  3. type
  4.   Single8 = record a,b,c,d,x,y,z,w:Single end;
  5.  
  6. procedure testAsm2();
  7. var
  8.   A: Single8;
  9.   AA: array[0..2] of Single8;
  10.   const ElSize = SizeOf(Single8);
  11. begin
  12.   A.x := 1;
  13.   A.y := 10;
  14.   A.z := 100;
  15.   A.w := 1000;
  16.   A.a := 1000;
  17.   A.b := 1000;
  18.   A.c := 1000;
  19.   A.d := 1000;
  20.  
  21.   asm
  22.     vmovups ymm0,A
  23.     vmovups ymm1,A
  24.     vaddps ymm0, ymm0, ymm1
  25.     vmovups A,ymm0
  26.     vmovups AA[1*ElSize],ymm0
  27.     vmovups AA[2*ElSize],ymm0
  28.   end;
  29.   WriteLn(A.x:6:4,' ',A.y:6:4,' ', A.z:6:4,' ', A.w:6:4);
  30.  
  31.   AA[1].y := 12;
  32.   AA[2].z := 14;
  33.  
  34.   WriteLn(AA[1].x:6:4,' ',AA[1].y:6:4,' ', AA[1].z:6:4,' ', AA[1].w:6:4);
  35.   WriteLn(AA[2].x:6:4,' ',AA[2].y:6:4,' ', AA[2].z:6:4,' ', AA[2].w:6:4);
  36.  
  37.   asm
  38.     vmovups ymm0,AA[1*ElSize]
  39.     vmovups ymm1,AA[2*ElSize]
  40.     vaddps ymm0, ymm0, ymm1
  41.     vmovups AA[0*ElSize],ymm0
  42.   end;
  43. WriteLn(AA[0].x:6:4,' ',AA[0].y:6:4,' ', AA[0].z:6:4,' ', AA[0].w:6:4);
  44.  
  45. end;
  46.  

SonnyBoyXXl

  • Jr. Member
  • **
  • Posts: 57
Re: AVX and SSE support question
« Reply #10 on: November 02, 2017, 01:30:28 pm »
This is an interesting topic.
I'm currently working on the translation of the DirectXMath units since
Quote
"The math functions of the D3DX utility library are deprecated for Windows 8. We recommend that you use DirectXMath instead."

I tried now following code with compiler settings
-al
-CfAVX2
-CpCOREAVX2
-O3
-Sv
-OpCOREAVX2
-OoFASTMATH


Code: Pascal  [Select][+][-]
  1. class operator TXMFLOAT4.Add(a: TXMFLOAT4; b: TXMFLOAT4): TXMFLOAT4;
  2. var
  3.   r: TXMFLOAT4;
  4. begin
  5.     result.x:=a.x+b.x;
  6.    result.y:=a.y+b.y;
  7.    result.z:=a.z+b.z;
  8.    result.w:=a.w+b.w;
  9. end;                  

bringing up this assembler code, which looks quit ineffizience:
Code: Pascal  [Select][+][-]
  1. # [292] begin
  2.         pushl   %ebx
  3.         pushl   %esi
  4.         pushl   %edi
  5.         leal    -56(%esp),%esp
  6. .Lc41:
  7. # Var a located at esp+0, size=OS_32
  8. # Var b located at esp+4, size=OS_32
  9. # Var r located at esp+8, size=OS_NO
  10.         movl    %eax,(%esp)
  11.         movl    %edx,4(%esp)
  12.         movl    %ecx,%ebx
  13. # Var $result located in register ebx
  14.         movl    (%esp),%esi
  15.         leal    24(%esp),%edi
  16.         movl    $4,%ecx
  17.         rep
  18.         movsl
  19.         movl    4(%esp),%esi
  20.         leal    40(%esp),%edi
  21.         movl    $4,%ecx
  22.         rep
  23.         movsl
  24. .Ll61:
  25.         movl    %ebx,%eax
  26.         movb    $85,%cl
  27.         movl    $16,%edx
  28.         call    fpc_fillmem
  29.         leal    8(%esp),%eax
  30.         movb    $85,%cl
  31.         movl    $16,%edx
  32.         call    fpc_fillmem
  33. .Ll62:
  34. # [301] result.x:=a.x+b.x;
  35.         vmovss  24(%esp),%xmm0
  36.         vaddss  40(%esp),%xmm0,%xmm0
  37.         vmovss  %xmm0,(%ebx)
  38. .Ll63:
  39. # [302] result.y:=a.y+b.y;
  40.         vmovss  28(%esp),%xmm0
  41.         vaddss  44(%esp),%xmm0,%xmm0
  42.         vmovss  %xmm0,4(%ebx)
  43. .Ll64:
  44. # [303] result.z:=a.z+b.z;
  45.         vmovss  32(%esp),%xmm0
  46.         vaddss  48(%esp),%xmm0,%xmm0
  47.         vmovss  %xmm0,8(%ebx)
  48. .Ll65:
  49. # [304] result.w:=a.w+b.w;
  50.         vmovss  36(%esp),%xmm0
  51.         vaddss  52(%esp),%xmm0,%xmm0
  52.         vmovss  %xmm0,12(%ebx)
  53. .Ll66:
  54. # [305] end;
  55.         leal    56(%esp),%esp
  56.         popl    %edi
  57.         popl    %esi
  58.         popl    %ebx
  59.         ret


I tried now to adapt as follows:

Code: Pascal  [Select][+][-]
  1. class operator TXMFLOAT4.Add(a: TXMFLOAT4; b: TXMFLOAT4): TXMFLOAT4;
  2. var
  3.   r: TXMFLOAT4;
  4. begin
  5.      asm
  6.     vmovups xmm0,a
  7.     vmovups xmm1,b
  8.     vaddps xmm1, xmm0, xmm1
  9.     vmovups r,xmm1
  10.     end;
  11. result:=r;  
  12. end;      

giving this assembler code

Code: Pascal  [Select][+][-]
  1. # [292] begin
  2.         pushl   %ebp
  3. .Lc41:
  4. .Lc42:
  5.         movl    %esp,%ebp
  6. .Lc43:
  7.         leal    -60(%esp),%esp
  8.         pushl   %esi
  9.         pushl   %edi
  10. # Var a located at ebp-4, size=OS_32
  11. # Var b located at ebp-8, size=OS_32
  12. # Var $result located at ebp-12, size=OS_32
  13. # Var r located at ebp-28, size=OS_NO
  14.         movl    %eax,-4(%ebp)
  15.         movl    %edx,-8(%ebp)
  16.         movl    %ecx,-12(%ebp)
  17.         movl    -4(%ebp),%eax
  18.         leal    -44(%ebp),%edi
  19.         movl    %eax,%esi
  20.         movl    $4,%ecx
  21.         rep
  22.         movsl
  23.         movl    -8(%ebp),%esi
  24.         leal    -60(%ebp),%edi
  25.         movl    $4,%ecx
  26.         rep
  27.         movsl
  28. .Ll61:
  29.         movl    -12(%ebp),%eax
  30.         movb    $85,%cl
  31.         movl    $16,%edx
  32.         call    fpc_fillmem
  33.         leal    -28(%ebp),%eax
  34.         movb    $85,%cl
  35.         movl    $16,%edx
  36.         call    fpc_fillmem
  37. #  CPU COREAVX2
  38. .Ll62:
  39. # [294] vmovups xmm0,a
  40.         vmovups -44(%ebp),%xmm0
  41. .Ll63:
  42. # [295] vmovups xmm1,b
  43.         vmovups -60(%ebp),%xmm1
  44. .Ll64:
  45. # [296] vaddps xmm1, xmm0, xmm1
  46.         vaddps  %xmm1,%xmm0,%xmm1
  47. .Ll65:
  48. # [297] vmovups r,xmm1
  49.         vmovups %xmm1,-28(%ebp)
  50. #  CPU COREAVX2
  51. .Ll66:
  52. # [299] result:=r;
  53.         movl    -12(%ebp),%edi
  54.         leal    -28(%ebp),%esi
  55.         movl    $4,%ecx
  56.         rep
  57.         movsl
  58. .Ll67:
  59. # [304] end;
  60.         popl    %edi
  61.         popl    %esi
  62.         movl    %ebp,%esp
  63.         popl    %ebp
  64.         ret
  65.  

which seem quit more effiency cause the math part is done with one call.
any more optimization possible?

and most of all: is assemble manualy really necessary or are there any possibilities with FPC itself?

Thanks!

schuler

  • Full Member
  • ***
  • Posts: 223
Re: AVX and SSE support question
« Reply #11 on: November 09, 2017, 10:30:22 pm »
Hello Sonny.

You may find (or not - not sure) some inspiration with AVX + FPC here:
https://sourceforge.net/p/cai/svncode/HEAD/tree/trunk/lazarus/libs/uvolume.pas

Plus some details here:
https://www.youtube.com/watch?v=qGnfwpKUTIQ

Akira1364

  • Hero Member
  • *****
  • Posts: 561
Re: AVX and SSE support question
« Reply #12 on: November 18, 2017, 10:44:35 am »
Code: Pascal  [Select][+][-]
  1. class operator TXMFLOAT4.Add(a: TXMFLOAT4; b: TXMFLOAT4): TXMFLOAT4;
  2. var
  3.   r: TXMFLOAT4;
  4. begin
  5.     result.x:=a.x+b.x;
  6.    result.y:=a.y+b.y;
  7.    result.z:=a.z+b.z;
  8.    result.w:=a.w+b.w;
  9. end;          
  10.  
     

I'm assuming your translation of XMFLOAT4 is a record type? Try declaring the function like this instead:

Code: Pascal  [Select][+][-]
  1. class operator TXMFLOAT4.Add(constref A, B: TXMFLOAT4): TXMFLOAT4;

BeanzMaster

  • Sr. Member
  • ****
  • Posts: 268
Re: AVX and SSE support question
« Reply #13 on: November 18, 2017, 02:22:49 pm »
Hello Sonny.

You may find (or not - not sure) some inspiration with AVX + FPC here:
https://sourceforge.net/p/cai/svncode/HEAD/tree/trunk/lazarus/libs/uvolume.pas

Plus some details here:
https://www.youtube.com/watch?v=qGnfwpKUTIQ

Hi, to all
Very interesting subject. Like i've supposed using asm with Arrays of values increase speed.
But if the operation consist of just 1 operation (eg C = A+B), the FPC compiler optimizes, code in a better way, according to my tests 
With arrays it not seems to be the case  8-)

I really need do more research on SSE and AVX in this way for improving  GLScene's VectorMaths units

Akira1364

  • Hero Member
  • *****
  • Posts: 561
Re: AVX and SSE support question
« Reply #14 on: November 18, 2017, 04:10:45 pm »
Ok, I just tested out my suggestion for Sonny with code that looks like this:

Code: Pascal  [Select][+][-]
  1. program DXMathTest;
  2.  
  3. type
  4.   TXMFloat4 = record
  5.     X, Y, Z, W: Single;
  6.     class operator Add(constref A, B: TXMFloat4): TXMFloat4; inline;
  7.   end;
  8.  
  9.   class operator TXMFloat4.Add(constref A, B: TXMFloat4): TXMFloat4;
  10.   begin
  11.     with Result do
  12.     begin
  13.       X := A.X + B.X;
  14.       Y := A.Y + B.Y;
  15.       Z := A.Z + B.Z;
  16.       W := A.W + B.W;
  17.     end;
  18.   end;
  19.  
  20. begin
  21. end.

After building it with identical compiler flags as the ones he said he was using, as I expected, the added "constref" makes the assembler output much more reasonable:

Code: Pascal  [Select][+][-]
  1. # [10] begin
  2.         movq    %rcx,%rax
  3. # Var $result located in register rax
  4. # Var A located in register rdx
  5. # Var B located in register r8
  6. # [13] X := A.X + B.X;
  7.         vmovss  (%rdx),%xmm0
  8.         vaddss  (%r8),%xmm0,%xmm0
  9.         vmovss  %xmm0,(%rax)
  10. # [14] Y := A.Y + B.Y;
  11.         vmovss  4(%rdx),%xmm0
  12.         vaddss  4(%r8),%xmm0,%xmm0
  13.         vmovss  %xmm0,4(%rax)
  14. # [15] Z := A.Z + B.Z;
  15.         vmovss  8(%rdx),%xmm0
  16.         vaddss  8(%r8),%xmm0,%xmm0
  17.         vmovss  %xmm0,8(%rax)
  18. # [16] W := A.W + B.W;
  19.         vmovss  12(%rdx),%xmm0
  20.         vaddss  12(%r8),%xmm0,%xmm0
  21.         vmovss  %xmm0,12(%rax)
  22. # [18] end;
  23.         ret

Moral of the story: always pass record types as "constref" or at the very least "const" anywhere/anytime it's possible to do so, in order to avoid making copies of them everytime the method is called.

 

TinyPortal © 2005-2018