Recent

Author Topic: ? Fastest Way to Calculate Dot Product?  (Read 6906 times)

schuler

  • Full Member
  • ***
  • Posts: 223
? Fastest Way to Calculate Dot Product?
« on: August 16, 2017, 11:42:12 pm »
:) Hello Pascal Lovers  :)

I'm coding a convolution neural network at:
https://sourceforge.net/p/cai/svncode/HEAD/tree/trunk/lazarus/libs/uconvolutionneuralnetwork.pas

I have a piece of code that will be run billions of times:

Code: Pascal  [Select][+][-]
  1. function TVolume.DotProduct(Original: TVolume): T;
  2. var
  3.   I: integer;
  4.   vHigh: integer;
  5. begin
  6.   Result := 0;
  7.   vHigh := High(FData);
  8.   for I := 0 to vHigh do Result += FData[I] * Original.FData[I];
  9. end;
  10.  

Assuming type T is Single and FData is an array of Single, what is the fastest way to code it? Do we have any SS2, SS4 implementation ready for Lazarus/FPC ?

In the case we don't have anything ready to use, should I code one with asm/SSE?

 :)

Phil

  • Hero Member
  • *****
  • Posts: 2737
Re: ? Fastest Way to Calculate Dot Product?
« Reply #1 on: August 16, 2017, 11:55:55 pm »
Assuming type T is Single and FData is an array of Single, what is the fastest way to code it? Do we have any SS2, SS4 implementation ready for Lazarus/FPC ?

Maybe step 1 would be to make sure Single is faster the Double.


Thaddy

  • Hero Member
  • *****
  • Posts: 14157
  • Probably until I exterminate Putin.
Re: ? Fastest Way to Calculate Dot Product?
« Reply #2 on: August 17, 2017, 09:02:42 am »
Assuming type T is Single and FData is an array of Single, what is the fastest way to code it? Do we have any SS2, SS4 implementation ready for Lazarus/FPC ?
In this particular case that should not be necessary. Just set the proper compiler options, like e.g.: -Sv -CfAVX2 (or likewise -CfSSE3 etc)
You can examine the assembler output with -s, so -Sv -CfAVX2 -a
You will notice the compiler will do a pretty good job on optimizing vectors. Example is in types.pp TpointF record.
« Last Edit: August 17, 2017, 06:40:12 pm by Thaddy »
Specialize a type, not a var.

Eugene Loza

  • Hero Member
  • *****
  • Posts: 656
    • My games in Pascal
Re: ? Fastest Way to Calculate Dot Product?
« Reply #3 on: August 17, 2017, 12:11:50 pm »
a piece of code that will be run billions of times
Maybe, consider some caching if it's possible?
Or use threads to separate calculation stages?
My FOSS games in FreePascal&CastleGameEngine: https://decoherence.itch.io/ (Sources: https://gitlab.com/EugeneLoza)

Leledumbo

  • Hero Member
  • *****
  • Posts: 8744
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: ? Fastest Way to Calculate Dot Product?
« Reply #4 on: August 17, 2017, 09:13:34 pm »
a piece of code that will be run billions of times
Maybe, consider some caching if it's possible?
Or use threads to separate calculation stages?
Or combine both. Especially for the threads, it's easily divided since the calculation is flat against the parameters. You can divide the as many threads as your CPU allows, that would speed up the calculation as much times as well.

schuler

  • Full Member
  • ***
  • Posts: 223
Re: ? Fastest Way to Calculate Dot Product?
« Reply #5 on: August 18, 2017, 10:06:41 am »
 :) thank you so much for so many qualified replies :)

Had a quick look at https://decoherence.itch.io/ and at http://pascalgeek.blogspot.com . Congrats to both. Wow! 40km walking. Opppsss, I'm diverging from the topic.

Back to topic:

I'll certainly give a shot to  -Sv -CfAVX2 . Thank you Thaddy.

BTW, the question is so interesting that might become a sub project inside of a sub project. Given that I'll have to benchmark code changes and inspect how the optimizer is doing it's job, I'll share findings so everyone can see.

In the case that you need help integrating neural networks into your own projects, please let me know.

 :) May the force be with Pascal :)

schuler

  • Full Member
  • ***
  • Posts: 223
Re: ? Fastest Way to Calculate Dot Product?
« Reply #6 on: September 16, 2017, 12:28:18 pm »
 :) Hello :)

To introduce the problem I'm having, this is an example of a parallel add:

Code: Pascal  [Select][+][-]
  1. type
  2.   Single4 = record x,y,z,w:Single end;
  3. var
  4.   A: Single4;
  5. begin
  6.   A.x := 1;
  7.   A.y := 10;
  8.   A.z := 100;
  9.   A.w := 1000;
  10.  
  11.   asm
  12.     movups xmm0,A
  13.     movups xmm1,A
  14.     addps xmm0,xmm1
  15.     movups A,xmm0
  16.   end;
  17.  
  18.   WriteLn(A.x:6:4,' ',A.y:6:4,' ', A.z:6:4,' ', A.w:6:4);
  19.  
  20. end;
  21.  
  22.  

Above code works as expected. It sums all 4 Single values. When I try to do something similar
via AVX2, I'm not having luck:

This code:
Code: Pascal  [Select][+][-]
  1.   for I := 0 to 1000 do
  2.   begin
  3.     A[I].x += B[I].x;
  4.     A[I].y += B[I].y;
  5.     A[I].z += B[I].z;
  6.     A[I].w += B[I].w;
  7.   end;
  8.  

Translates into:
Code: Text  [Select][+][-]
  1. uformcifar10.pas:1232             A[I].x += B[I].x;
  2. 00429BB3 89d0                     mov    %edx,%eax
  3. 00429BB5 c1e004                   shl    $0x4,%eax
  4. 00429BB8 89d1                     mov    %edx,%ecx
  5. 00429BBA c1e104                   shl    $0x4,%ecx
  6. 00429BBD c5fa10840570c1ffff       vmovss -0x3e90(%ebp,%eax,1),%xmm0
  7. 00429BC6 c5fa58840de082ffff       vaddss -0x7d20(%ebp,%ecx,1),%xmm0,%xmm0
  8. 00429BCF 89d0                     mov    %edx,%eax
  9. 00429BD1 c1e004                   shl    $0x4,%eax
  10. 00429BD4 c5fa11840570c1ffff       vmovss %xmm0,-0x3e90(%ebp,%eax,1)
  11. uformcifar10.pas:1233             A[I].y += B[I].y;
  12. 00429BDD 89d0                     mov    %edx,%eax
  13. 00429BDF c1e004                   shl    $0x4,%eax
  14. 00429BE2 89d1                     mov    %edx,%ecx
  15. 00429BE4 c1e104                   shl    $0x4,%ecx
  16. 00429BE7 c5fa10840574c1ffff       vmovss -0x3e8c(%ebp,%eax,1),%xmm0
  17. 00429BF0 c5fa58840de482ffff       vaddss -0x7d1c(%ebp,%ecx,1),%xmm0,%xmm0
  18. 00429BF9 89d0                     mov    %edx,%eax
  19. 00429BFB c1e004                   shl    $0x4,%eax
  20. 00429BFE c5fa11840574c1ffff       vmovss %xmm0,-0x3e8c(%ebp,%eax,1)
  21.  

As you can see above, the compiled code is using vmovss/vaddss instead of movups / addps. There is no actual parallel processing.

The same thing happens with:
Code: [Select]
var
  C,D: array[0..1000] of Single;
...
  C[0] := C[0] + D[0];
  C[1] := C[1] + D[1];
  C[2] := C[2] + D[2];
  C[3] := C[3] + D[3];

translates into:
Code: [Select]
uformcifar10.pas:1238             C[0] := C[0] + D[0];
00429C67 c5fa10853c73ffff         vmovss -0x8cc4(%ebp),%xmm0
00429C6F c5fa58859863ffff         vaddss -0x9c68(%ebp),%xmm0,%xmm0
00429C77 c5fa11853c73ffff         vmovss %xmm0,-0x8cc4(%ebp)
uformcifar10.pas:1239             C[1] := C[1] + D[1];
00429C7F c5fa10854073ffff         vmovss -0x8cc0(%ebp),%xmm0
00429C87 c5fa58859c63ffff         vaddss -0x9c64(%ebp),%xmm0,%xmm0
00429C8F c5fa11854073ffff         vmovss %xmm0,-0x8cc0(%ebp)
uformcifar10.pas:1240             C[2] := C[2] + D[2];
00429C97 c5fa10854473ffff         vmovss -0x8cbc(%ebp),%xmm0
00429C9F c5fa5885a063ffff         vaddss -0x9c60(%ebp),%xmm0,%xmm0
00429CA7 c5fa11854473ffff         vmovss %xmm0,-0x8cbc(%ebp)
uformcifar10.pas:1241             C[3] := C[3] + D[3];
00429CAF c5fa10854873ffff         vmovss -0x8cb8(%ebp),%xmm0
00429CB7 c5fa5885a463ffff         vaddss -0x9c5c(%ebp),%xmm0,%xmm0
00429CBF c5fa11854873ffff         vmovss %xmm0,-0x8cb8(%ebp)

Therefore, comes the question for the pascal lovers community: am I doing something wrong? If not, I think that I'll build some assembler based API to take full advantage of xmm.

 :) wishing everyone happy pascal coding :)



marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11351
  • FPC developer.
Re: ? Fastest Way to Calculate Dot Product?
« Reply #7 on: September 16, 2017, 01:12:09 pm »
Try in win64 (I mostly use 64-bit)

Code: [Select]
{$mode delphi}
type
  Single4 = record
                 case boolean of
                  true :(asvec: array[0..3] of single);
                  false:(x,y,z,w : single);
                 end;

procedure tt;
// wrapper procedure to not use global variables.
var
  A,B: array[0..10] of  Single4;
  i : integer;
begin
  for i:=low(a) to high(a) do
    a[i].asvec+=b[i].asvec;
end;
 
begin
 tt;
end.

compile with -Sv -O4 -Opcoreavx -O

Then the 64-bit code looks like:

Code: [Select]
.Lj5:
addl $1,%edx
# [16] a[i].asvec+=b[i].asvec;
movl %edx,%eax
shlq $4,%rax
movl %edx,%ecx
shlq $4,%rcx
movdqa (%rsp,%rax),%xmm0
addps 176(%rsp,%rcx),%xmm0
movl %edx,%eax
shlq $4,%rax
cmpl $10,%edx
jnge .Lj5
# [17] end;

64-bits archs usually have 16-byte alignment built in. The loop looks horrible and shows PFC lacks strength reduction.

ChrisR

  • Full Member
  • ***
  • Posts: 247
Re: ? Fastest Way to Calculate Dot Product?
« Reply #8 on: September 16, 2017, 01:30:32 pm »
You may want to compare the compiler optimization to hand-coded SSE/AVX instructions, for example the DotProduct function in
http://www.tommesani.com/ExentiaFeatures.html

schuler

  • Full Member
  • ***
  • Posts: 223
Re: ? Fastest Way to Calculate Dot Product?
« Reply #9 on: September 16, 2017, 01:47:30 pm »
 :) Hello ChrisR :)

Super thanks for the quick reply.

I'm one of the guys working on Exentia back in 2001 (I'm getting old). You'll find schuler on the author list - it's me!!! I coded the first version of UPExentia unit that uses Exentia. I also did some testing/bug fixing in the 64 bits version.

I'M AFRAID THAT I HAVE AMAZING NEWS
With few tweaks, I've just been able to make Exentia to compile under LAZARUS. YAY!!!

I've attached an image with testing result.

This is the testing code:
Code: Pascal  [Select][+][-]
  1. procedure TestExentia;
  2. var
  3.   TestVector1, TestVector2, TestVector:TFVector;
  4. begin
  5.   TestVector1 := TSSEVector.Create(cTestVectorSize);
  6.   TestVector2 := TSSEVector.Create(cTestVectorSize);
  7.   TestVector  := TSSEVector.Create(cTestVectorSize);
  8.   TestLogicalOps(TestVector1, TestVector2, TestVector);
  9.   TestBinaryBasicOps(TestVector1, TestVector2, TestVector);
  10.   TestBinaryComplexOps(TestVector1, TestVector2, TestVector);
  11.   TestVector1.Free;
  12.   TestVector2.Free;
  13.   TestVector.Free;
  14. end;
  15.  

I'll try to contact Stephano to decide how to publish the updated code for everyone to have access.

 :) I love pascal :)

schuler

  • Full Member
  • ***
  • Posts: 223
Re: ? Fastest Way to Calculate Dot Product?
« Reply #10 on: September 16, 2017, 01:54:39 pm »
:) Hello Marcov :)

Super thanks for the reply. I'll surely give a try in WIN64. Good idea indeed.

:) Love FPC and LAZARUS :)


marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11351
  • FPC developer.
Re: ? Fastest Way to Calculate Dot Product?
« Reply #11 on: September 16, 2017, 02:03:15 pm »
32-bit works too, but generates modqu's. I'm not sure if that is so bad though on modern architectures.

schuler

  • Full Member
  • ***
  • Posts: 223
Re: ? Fastest Way to Calculate Dot Product?
« Reply #12 on: September 19, 2017, 05:08:17 pm »
 :) Thank you so much marcov  :)

Loved your "record" declaration. It's very useful. It now works here with size 4 only.

I'm now having a compile time error as described here (and forgot my bugs.freepascal login  :-\ )  :
http://forum.lazarus.freepascal.org/index.php/topic,32741.0.html




schuler

  • Full Member
  • ***
  • Posts: 223
Re: ? Fastest Way to Calculate Dot Product?
« Reply #13 on: September 20, 2017, 04:18:41 am »
:) Hello :)
I coded a 13x faster code as follows inspired on Exentia code. But there are 2 catches:
* It works in 32 bits only.
* I'll need to code a 64 bits version.

This is the code:
Code: Pascal  [Select][+][-]
  1. {$ASMMODE intel}
  2. {$DEFINE AVX32}
  3.  
  4.   TNeuralFloat = Single;
  5.   TNeuralFloatArr = array[0..0] of TNeuralFloat;
  6.   TNeuralFloatArrPtr = ^TNeuralFloatArr;
  7.  
  8. {$IFDEF AVX32}
  9. {$RANGECHECKS OFF}
  10. {$OVERFLOWCHECKS OFF}
  11.  
  12. function AVXDotProduct32(PtrA, PtrB: TNeuralFloatArrPtr; NumElements: integer): Single;
  13. var
  14.   I: integer;
  15.   vAux: array[0..7] of Single;
  16.   vRes: array[0..3] of Single;
  17.   localNumElements, MissedElements: integer;
  18. begin
  19.   localNumElements := (NumElements div 4) * 4;
  20.   MissedElements := NumElements - localNumElements;
  21.   //WriteLn(localNumElements,' ',MissedElements);
  22.   if localNumElements > 0 then
  23.   begin
  24.   asm
  25.   vzeroupper
  26.   mov ecx, localNumElements
  27.   mov eax, PtrA
  28.   mov edx, PtrB
  29.   vxorps ymm0, ymm0, ymm0
  30.   vxorps ymm1, ymm1, ymm1
  31.  
  32.   push ecx
  33.   shr ecx,4  // number of large iterations = number of elements / 16
  34.   jz @SkipLargeAddLoop
  35. @LargeAddLoop:
  36.  
  37.   vmovups ymm2, [eax]
  38.   vmovups ymm6, [edx]
  39.  
  40.   vmovups ymm3, [eax+32]
  41.   vmovups ymm7, [edx+32]
  42.  
  43.   vmulps  ymm2, ymm2, ymm6
  44.   vmulps  ymm3, ymm3, ymm7
  45.  
  46.   vaddps  ymm0, ymm0, ymm2
  47.   vaddps  ymm1, ymm1, ymm3
  48.  
  49.   add eax, 64
  50.   add edx, 64
  51.   dec ecx
  52.   jnz @LargeAddLoop
  53.  
  54. @SkipLargeAddLoop:
  55.   pop ecx
  56.   and ecx,$0000000F
  57.   jz @EndAdd
  58.   shr ecx, 2 // number of small iterations = (number of elements modulo 16) / 4
  59. @SmallAddLoop:
  60.  
  61.   movups xmm2, [eax]
  62.   movups xmm3, [edx]
  63.   mulps xmm2, xmm3
  64.   addps xmm0, xmm2
  65.  
  66.   add eax, 16
  67.   add edx, 16
  68.   dec ecx
  69.   jnz @SmallAddLoop
  70.  
  71. @EndAdd:
  72.   vaddps ymm0, ymm0, ymm1
  73.   vmovups vAux, ymm0
  74.   movups xmm2, [vAux+16]
  75.   addps xmm0, xmm2
  76.   movups vRes, xmm0
  77.   vzeroupper
  78.   end
  79.   [
  80.     'EAX', 'ECX', 'EDX',
  81.     'xmm0', 'xmm1', 'xmm2', 'xmm3',
  82.     'ymm0', 'ymm1', 'ymm2', 'ymm3', 'ymm6', 'ymm7'
  83.   ];
  84.  
  85.     Result := vRes[0] + vRes[1] + vRes[2] + vRes[3];
  86.   end else
  87.   begin
  88.     Result := 0;
  89.   end;
  90.  
  91.   if MissedElements>0 then
  92.   begin
  93.     for I := localNumElements to NumElements-1 do
  94.       Result += PtrA^[I] * PtrB^[I];
  95.   end;
  96. end;
  97. {$ENDIF}
  98.  

Gave a heavy try and it seems working.

:) I wish everyone happy coding. :)
« Last Edit: September 21, 2017, 09:08:45 am by schuler »

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11351
  • FPC developer.
Re: ? Fastest Way to Calculate Dot Product?
« Reply #14 on: September 20, 2017, 07:36:21 am »
IIRC if you are done working with AVX, issue a vzeroupper instruction to clear the high halves of ymm registers. This avoids certain stalls when sse and avx are used together.

 

TinyPortal © 2005-2018