Recent

Author Topic: Parallel Code  (Read 9616 times)

rvk

  • Hero Member
  • *****
  • Posts: 4327
Re: Parallel Code
« Reply #30 on: February 19, 2018, 01:00:51 pm »
Ok, I kept it to standard downloaded Lazarus 1.8 32 and 64 bit.
Now I concentrated on the MOD (which is heavily used in isPrime()).

Quote
Running to 2147483647
64 bit laz1.8 (standard download): ModTest() took 20,297 seconds
32 bit laz1.8 (standard download): ModTest() took 6,093 seconds

(Not sure what to think of this)

Code: Pascal  [Select][+][-]
  1. program project1;
  2. {$MODE objfpc}
  3. uses SysUtils;
  4. const
  5.   // cMax = 2147483647; // MaxInt;
  6.   cMax = 50000000; // MaxInt;
  7. var
  8.   Tm: Double;
  9.   I: NativeInt;
  10.  
  11.   procedure ModTest(N: integer);
  12.   var
  13.     Test: NativeInt;
  14.     Q: NativeInt;
  15.   begin
  16.     for Test := 1 to 50 do
  17.     begin
  18.       Q := N mod Test;
  19.     end;
  20.   end;
  21.  
  22. begin
  23.   Writeln('Running to ', MaxInt);
  24.  
  25.   Tm := Now;
  26.  
  27.   for I := 1 to cMax do
  28.   begin
  29.     ModTest(I);
  30.   end;
  31.  
  32.   Tm := Now - Tm;
  33.   WriteLn(Format('ModTest() took %.3f seconds', [Tm * 24 * 60 * 60]));
  34.  
  35.   Writeln('Press enter key');
  36.   readln;
  37. end.

Nitorami

  • Sr. Member
  • ****
  • Posts: 378
Re: Parallel Code
« Reply #31 on: February 19, 2018, 02:43:32 pm »
Rather precisely the same result here, in 64 bit mode, modulo is factor 3 slower than in 32 bit mode, on an Intel CPU. What is even more surprising, if we force the 32bit-Compiler to use a 64 bit modulo, it calls fpc_mod_int64, a rather lengthy function, but is STILL not slower than in 64bit mode, where the division is a single idiv instruction...

Code: Pascal  [Select][+][-]
  1. Dump of assembler code for function fpc_mod_int64:
  2. :       push   %ebp
  3. $00405B11:      mov    %esp,%ebp
  4. $00405B13:      lea    -0x14(%esp),%esp
  5. $00405B17:      mov    %ebx,-0xc(%ebp)
  6. $00405B1A:      mov    %esi,-0x14(%ebp)
  7. $00405B1D:      mov    %edi,-0x10(%ebp)
  8. $00405B20:      mov    0x14(%ebp),%ecx
  9. $00405B23:      mov    0x10(%ebp),%ebx
  10. $00405B26:      mov    %ecx,%eax
  11. $00405B28:      or     %ebx,%eax
  12. $00405B2A:      jne    0x405b3d <fpc_mod_int64+45>
  13. $00405B2C:      mov    %ebp,%edx
  14. $00405B2E:      mov    $0xc8,%eax
  15. $00405B33:      call   0x408b20 <SYSTEM_$$_HANDLEERRORFRAME$LONGINT$POINTER>
  16. $00405B38:      jmp    0x405be6 <fpc_mod_int64+214>
  17. $00405B3D:      mov    0xc(%ebp),%edx
  18. $00405B40:      mov    0x8(%ebp),%eax
  19. $00405B43:      mov    %edx,%esi
  20. $00405B45:      sar    $0x1f,%esi
  21. $00405B48:      mov    %edx,%edi
  22. $00405B4A:      sar    $0x1f,%edi
  23. $00405B4D:      xor    %edi,%eax
  24. $00405B4F:      xor    %edi,%edx
  25. $00405B51:      sub    %edi,%eax
  26. $00405B53:      sbb    %edi,%edx
  27. $00405B55:      mov    %ecx,%edi
  28. $00405B57:      sar    $0x1f,%edi
  29. $00405B5A:      xor    %edi,%ebx
  30. $00405B5C:      xor    %edi,%ecx
  31. $00405B5E:      sub    %edi,%ebx
  32. $00405B60:      sbb    %edi,%ecx
  33. $00405B62:      jne    0x405b8a <fpc_mod_int64+122>
  34. $00405B64:      cmp    %ebx,%edx
  35. $00405B66:      jae    0x405b78 <fpc_mod_int64+104>
  36. $00405B68:      div    %ebx
  37. $00405B6A:      mov    %edx,%eax
  38. $00405B6C:      mov    %ecx,%edx
  39. $00405B6E:      xor    %esi,%eax
  40. $00405B70:      xor    %esi,%edx
  41. $00405B72:      sub    %esi,%eax
  42. $00405B74:      sbb    %esi,%edx
  43. $00405B76:      jmp    0x405be6 <fpc_mod_int64+214>
  44. $00405B78:      mov    %eax,%ecx
  45. $00405B7A:      mov    %edx,%eax
  46. $00405B7C:      xor    %edx,%edx
  47. $00405B7E:      div    %ebx
  48. $00405B80:      mov    %ecx,%eax
  49. $00405B82:      div    %ebx
  50. $00405B84:      mov    %edx,%eax
  51. $00405B86:      xor    %edx,%edx
  52. $00405B88:      jmp    0x405bde <fpc_mod_int64+206>
  53. $00405B8A:      sub    $0x10,%esp
  54. $00405B8D:      mov    %eax,(%esp)
  55. $00405B90:      mov    %ebx,0x4(%esp)
  56. $00405B94:      mov    %edx,0x8(%esp)
  57. $00405B98:      mov    %ecx,0xc(%esp)
  58. $00405B9C:      mov    %ecx,%edi
  59. $00405B9E:      shr    %edx
  60. .... etc etc.
  61.  

dieselfan

  • New Member
  • *
  • Posts: 16
Re: Parallel Code
« Reply #32 on: February 19, 2018, 07:25:12 pm »
I see 2 possible issues here

1. Huge difference between 32 and 64bit on Intel CPU's
2. The code on an AMD runs almost 3x slower in ST and if it weren't for the more cores would be very slow. Compared to other non-prime "benchmarks" I've run the AMD ST is roughly 105% the time taken vs a 7th gen i7. Here it's 3x vs a 3rd gen i7.

I have access to a couple of cpu's and generations and will test with Delphi too
AMD 1800X
Manjaro KDE
Rarely Windows 10 Pro

Phemtik

  • New Member
  • *
  • Posts: 19
Re: Parallel Code
« Reply #33 on: February 20, 2018, 09:15:21 pm »
I think i found a bit about the difference.

It's a mixture of a bad optimization and Intel. ::)
So first the optimization.

The difference between x86 and x64 for UInt32 mostly comes from an doubled multiplication.
x86 do [ lea (%edx,%edx,2),%edx ] but x64 do [ imul  $0x3,%edx,%edx ].
The Instruction Table document say, there should no speed difference but it seems there is little.

The difference between x86 Uint32 and x86 Int32 come from the needed IDIV instruction because of signed type

Now we come to the most Interesting point.
A combination of bad optimization and Intel.
Both belong together, because it is only an Intel thing.

Under X64 fpc generate code that use 64-bit registers (%rax and %rcx) that is not necessary. => Bad Optimization (for Intel)
Because Intel need after the document a lot more time and power if we use 64-bit registers.
On AMD it make no difference witch registers we use. Every register take the same time.

So it is a combination of bad optimization and Intel.


CPU: Intel i7-3610QM
FPC: 3.1.1@38232
Options: -O3 -OpCOREAVX -CpCOREAVX

Times
UInt32 x86: 4.897 s, 4.934 s, 4.873 s
Int32 x86   : 7.686 s, 7.675 s, 7.668 s
UInt32 x64: 5.150 s, 5.180 s, 5.228 s
Int32 x64   : 23.515 s, 23.409 s, 23.424 s

Instruction Tables:  http://www.agner.org/optimize/instruction_tables.pdf

The code i used:
Code: Pascal  [Select][+][-]
  1. program project1;
  2.  
  3. {$mode objfpc}
  4.  
  5. uses
  6.     {$IFDEF UNIX}{$IFDEF UseCThreads}
  7.     cthreads,
  8.     {$ENDIF}{$ENDIF}
  9.     SysUtils;
  10.  
  11. type
  12.   TTestType = UInt32;
  13.   //TTestType = Int32;
  14. const
  15.   cMax = High(Int32);
  16.   //cMax = 50000000;
  17.   //cMax = 5;
  18.  
  19. procedure ModTest(const N: TTestType);
  20.   var
  21.       Q: TTestType;
  22.   begin
  23.       Q := N mod 3;
  24.   end;
  25.  
  26. var
  27.   Tm: Double;
  28.   I: TTestType;
  29.  
  30. begin
  31.   Writeln('Running to ', cMax);
  32.   Tm := Now;
  33.   for I := 1 to cMax do
  34.   begin
  35.       ModTest(I);
  36.   end;
  37.   Tm := Now - Tm;
  38.   WriteLn(Format('ModTest() took %.3f seconds', [Tm * 24 * 60 * 60]));
  39. end.
  40.  

x86 UInt32:
Code: Pascal  [Select][+][-]
  1. project1.lpr:22                   Q := N mod 3;
  2. 08048112 b8abaaaaaa               mov    $0xaaaaaaab,%eax
  3. 08048117 f7e1                     mul    %ecx
  4. 08048119 d1ea                     shr    %edx
  5. 0804811B 8d1452                   lea    (%edx,%edx,2),%edx
  6. 0804811E 29d1                     sub    %edx,%ecx

x86 Int32:
Code: Pascal  [Select][+][-]
  1. project1.lpr:22                   Q := N mod 3;
  2. 08048110 99                       cltd  
  3. 08048111 b903000000               mov    $0x3,%ecx
  4. 08048116 f7f9                     idiv   %ecx

x64 UInt32:
Code: Pascal  [Select][+][-]
  1. project1.lpr:22                           Q := N mod 3;
  2. 0000000000400200 b8abaaaaaa               mov    $0xaaaaaaab,%eax
  3. 0000000000400205 f7e7                     mul    %edi
  4. 0000000000400207 d1ea                     shr    %edx
  5. 0000000000400209 6bd203                   imul   $0x3,%edx,%edx
  6. 000000000040020C 29d7                     sub    %edx,%edi

x64 Int32:
Code: Pascal  [Select][+][-]
  1. project1.lpr:22                           Q := N mod 3;
  2. 0000000000400202 4863c0                   movslq %eax,%rax
  3. 0000000000400205 4899                     cqto  
  4. 0000000000400207 b903000000               mov    $0x3,%ecx
  5. 000000000040020C 48f7f9                   idiv   %rcx

Intel i7-3610QM
Fedora 28

rvk

  • Hero Member
  • *****
  • Posts: 4327
Re: Parallel Code
« Reply #34 on: February 20, 2018, 09:45:40 pm »
O wow, that's some good information.
And yes, changing the tests to UInt32 does result in much much more speed.

Even the original isPrime() function is faster although it needs to be reworked somewhat. Because "if (N mod Test) = 0" internally still converts to using a Int32 (?) even if N and Test are UInt32s you need to first set A1 := N mod Test (where A1 is a UInt32) and use that to check against 0.

I hope some compiler experts/programmers are reading this and picking this up (or maybe a bugfix needs to be entered in the tracker). The small slowdown with UInt32 from x86 to x64 isn't such a big deal (I'm positive that will impact Delphi too) but the bad optimization could be fixed (Delphi doesn't suffer from this).

coolmyf

  • Newbie
  • Posts: 3
Re: Parallel Code
« Reply #35 on: September 08, 2019, 08:14:05 am »
In my program, I changed int64 to uint64. I found the program run 3 times faster than before.

 

TinyPortal © 2005-2018