Recent

Author Topic: Parallel Code  (Read 10603 times)

rvk

  • Hero Member
  • *****
  • Posts: 4465
Re: Parallel Code
« Reply #30 on: February 19, 2018, 01:00:51 pm »
Ok, I kept it to standard downloaded Lazarus 1.8 32 and 64 bit.
Now I concentrated on the MOD (which is heavily used in isPrime()).

Quote
Running to 2147483647
64 bit laz1.8 (standard download): ModTest() took 20,297 seconds
32 bit laz1.8 (standard download): ModTest() took 6,093 seconds

(Not sure what to think of this)

Code: Pascal  [Select][+][-]
  1. program project1;
  2. {$MODE objfpc}
  3. uses SysUtils;
  4. const
  5.   // cMax = 2147483647; // MaxInt;
  6.   cMax = 50000000; // MaxInt;
  7. var
  8.   Tm: Double;
  9.   I: NativeInt;
  10.  
  11.   procedure ModTest(N: integer);
  12.   var
  13.     Test: NativeInt;
  14.     Q: NativeInt;
  15.   begin
  16.     for Test := 1 to 50 do
  17.     begin
  18.       Q := N mod Test;
  19.     end;
  20.   end;
  21.  
  22. begin
  23.   Writeln('Running to ', MaxInt);
  24.  
  25.   Tm := Now;
  26.  
  27.   for I := 1 to cMax do
  28.   begin
  29.     ModTest(I);
  30.   end;
  31.  
  32.   Tm := Now - Tm;
  33.   WriteLn(Format('ModTest() took %.3f seconds', [Tm * 24 * 60 * 60]));
  34.  
  35.   Writeln('Press enter key');
  36.   readln;
  37. end.

Nitorami

  • Sr. Member
  • ****
  • Posts: 395
Re: Parallel Code
« Reply #31 on: February 19, 2018, 02:43:32 pm »
Rather precisely the same result here, in 64 bit mode, modulo is factor 3 slower than in 32 bit mode, on an Intel CPU. What is even more surprising, if we force the 32bit-Compiler to use a 64 bit modulo, it calls fpc_mod_int64, a rather lengthy function, but is STILL not slower than in 64bit mode, where the division is a single idiv instruction...

Code: Pascal  [Select][+][-]
  1. Dump of assembler code for function fpc_mod_int64:
  2. :       push   %ebp
  3. $00405B11:      mov    %esp,%ebp
  4. $00405B13:      lea    -0x14(%esp),%esp
  5. $00405B17:      mov    %ebx,-0xc(%ebp)
  6. $00405B1A:      mov    %esi,-0x14(%ebp)
  7. $00405B1D:      mov    %edi,-0x10(%ebp)
  8. $00405B20:      mov    0x14(%ebp),%ecx
  9. $00405B23:      mov    0x10(%ebp),%ebx
  10. $00405B26:      mov    %ecx,%eax
  11. $00405B28:      or     %ebx,%eax
  12. $00405B2A:      jne    0x405b3d <fpc_mod_int64+45>
  13. $00405B2C:      mov    %ebp,%edx
  14. $00405B2E:      mov    $0xc8,%eax
  15. $00405B33:      call   0x408b20 <SYSTEM_$$_HANDLEERRORFRAME$LONGINT$POINTER>
  16. $00405B38:      jmp    0x405be6 <fpc_mod_int64+214>
  17. $00405B3D:      mov    0xc(%ebp),%edx
  18. $00405B40:      mov    0x8(%ebp),%eax
  19. $00405B43:      mov    %edx,%esi
  20. $00405B45:      sar    $0x1f,%esi
  21. $00405B48:      mov    %edx,%edi
  22. $00405B4A:      sar    $0x1f,%edi
  23. $00405B4D:      xor    %edi,%eax
  24. $00405B4F:      xor    %edi,%edx
  25. $00405B51:      sub    %edi,%eax
  26. $00405B53:      sbb    %edi,%edx
  27. $00405B55:      mov    %ecx,%edi
  28. $00405B57:      sar    $0x1f,%edi
  29. $00405B5A:      xor    %edi,%ebx
  30. $00405B5C:      xor    %edi,%ecx
  31. $00405B5E:      sub    %edi,%ebx
  32. $00405B60:      sbb    %edi,%ecx
  33. $00405B62:      jne    0x405b8a <fpc_mod_int64+122>
  34. $00405B64:      cmp    %ebx,%edx
  35. $00405B66:      jae    0x405b78 <fpc_mod_int64+104>
  36. $00405B68:      div    %ebx
  37. $00405B6A:      mov    %edx,%eax
  38. $00405B6C:      mov    %ecx,%edx
  39. $00405B6E:      xor    %esi,%eax
  40. $00405B70:      xor    %esi,%edx
  41. $00405B72:      sub    %esi,%eax
  42. $00405B74:      sbb    %esi,%edx
  43. $00405B76:      jmp    0x405be6 <fpc_mod_int64+214>
  44. $00405B78:      mov    %eax,%ecx
  45. $00405B7A:      mov    %edx,%eax
  46. $00405B7C:      xor    %edx,%edx
  47. $00405B7E:      div    %ebx
  48. $00405B80:      mov    %ecx,%eax
  49. $00405B82:      div    %ebx
  50. $00405B84:      mov    %edx,%eax
  51. $00405B86:      xor    %edx,%edx
  52. $00405B88:      jmp    0x405bde <fpc_mod_int64+206>
  53. $00405B8A:      sub    $0x10,%esp
  54. $00405B8D:      mov    %eax,(%esp)
  55. $00405B90:      mov    %ebx,0x4(%esp)
  56. $00405B94:      mov    %edx,0x8(%esp)
  57. $00405B98:      mov    %ecx,0xc(%esp)
  58. $00405B9C:      mov    %ecx,%edi
  59. $00405B9E:      shr    %edx
  60. .... etc etc.
  61.  

dieselfan

  • New Member
  • *
  • Posts: 16
Re: Parallel Code
« Reply #32 on: February 19, 2018, 07:25:12 pm »
I see 2 possible issues here

1. Huge difference between 32 and 64bit on Intel CPU's
2. The code on an AMD runs almost 3x slower in ST and if it weren't for the more cores would be very slow. Compared to other non-prime "benchmarks" I've run the AMD ST is roughly 105% the time taken vs a 7th gen i7. Here it's 3x vs a 3rd gen i7.

I have access to a couple of cpu's and generations and will test with Delphi too
AMD 1800X
Manjaro KDE
Rarely Windows 10 Pro

Phemtik

  • New Member
  • *
  • Posts: 19
Re: Parallel Code
« Reply #33 on: February 20, 2018, 09:15:21 pm »
I think i found a bit about the difference.

It's a mixture of a bad optimization and Intel. ::)
So first the optimization.

The difference between x86 and x64 for UInt32 mostly comes from an doubled multiplication.
x86 do [ lea (%edx,%edx,2),%edx ] but x64 do [ imul  $0x3,%edx,%edx ].
The Instruction Table document say, there should no speed difference but it seems there is little.

The difference between x86 Uint32 and x86 Int32 come from the needed IDIV instruction because of signed type

Now we come to the most Interesting point.
A combination of bad optimization and Intel.
Both belong together, because it is only an Intel thing.

Under X64 fpc generate code that use 64-bit registers (%rax and %rcx) that is not necessary. => Bad Optimization (for Intel)
Because Intel need after the document a lot more time and power if we use 64-bit registers.
On AMD it make no difference witch registers we use. Every register take the same time.

So it is a combination of bad optimization and Intel.


CPU: Intel i7-3610QM
FPC: 3.1.1@38232
Options: -O3 -OpCOREAVX -CpCOREAVX

Times
UInt32 x86: 4.897 s, 4.934 s, 4.873 s
Int32 x86   : 7.686 s, 7.675 s, 7.668 s
UInt32 x64: 5.150 s, 5.180 s, 5.228 s
Int32 x64   : 23.515 s, 23.409 s, 23.424 s

Instruction Tables:  http://www.agner.org/optimize/instruction_tables.pdf

The code i used:
Code: Pascal  [Select][+][-]
  1. program project1;
  2.  
  3. {$mode objfpc}
  4.  
  5. uses
  6.     {$IFDEF UNIX}{$IFDEF UseCThreads}
  7.     cthreads,
  8.     {$ENDIF}{$ENDIF}
  9.     SysUtils;
  10.  
  11. type
  12.   TTestType = UInt32;
  13.   //TTestType = Int32;
  14. const
  15.   cMax = High(Int32);
  16.   //cMax = 50000000;
  17.   //cMax = 5;
  18.  
  19. procedure ModTest(const N: TTestType);
  20.   var
  21.       Q: TTestType;
  22.   begin
  23.       Q := N mod 3;
  24.   end;
  25.  
  26. var
  27.   Tm: Double;
  28.   I: TTestType;
  29.  
  30. begin
  31.   Writeln('Running to ', cMax);
  32.   Tm := Now;
  33.   for I := 1 to cMax do
  34.   begin
  35.       ModTest(I);
  36.   end;
  37.   Tm := Now - Tm;
  38.   WriteLn(Format('ModTest() took %.3f seconds', [Tm * 24 * 60 * 60]));
  39. end.
  40.  

x86 UInt32:
Code: Pascal  [Select][+][-]
  1. project1.lpr:22                   Q := N mod 3;
  2. 08048112 b8abaaaaaa               mov    $0xaaaaaaab,%eax
  3. 08048117 f7e1                     mul    %ecx
  4. 08048119 d1ea                     shr    %edx
  5. 0804811B 8d1452                   lea    (%edx,%edx,2),%edx
  6. 0804811E 29d1                     sub    %edx,%ecx

x86 Int32:
Code: Pascal  [Select][+][-]
  1. project1.lpr:22                   Q := N mod 3;
  2. 08048110 99                       cltd  
  3. 08048111 b903000000               mov    $0x3,%ecx
  4. 08048116 f7f9                     idiv   %ecx

x64 UInt32:
Code: Pascal  [Select][+][-]
  1. project1.lpr:22                           Q := N mod 3;
  2. 0000000000400200 b8abaaaaaa               mov    $0xaaaaaaab,%eax
  3. 0000000000400205 f7e7                     mul    %edi
  4. 0000000000400207 d1ea                     shr    %edx
  5. 0000000000400209 6bd203                   imul   $0x3,%edx,%edx
  6. 000000000040020C 29d7                     sub    %edx,%edi

x64 Int32:
Code: Pascal  [Select][+][-]
  1. project1.lpr:22                           Q := N mod 3;
  2. 0000000000400202 4863c0                   movslq %eax,%rax
  3. 0000000000400205 4899                     cqto  
  4. 0000000000400207 b903000000               mov    $0x3,%ecx
  5. 000000000040020C 48f7f9                   idiv   %rcx

Intel i7-3610QM
Fedora 28

rvk

  • Hero Member
  • *****
  • Posts: 4465
Re: Parallel Code
« Reply #34 on: February 20, 2018, 09:45:40 pm »
O wow, that's some good information.
And yes, changing the tests to UInt32 does result in much much more speed.

Even the original isPrime() function is faster although it needs to be reworked somewhat. Because "if (N mod Test) = 0" internally still converts to using a Int32 (?) even if N and Test are UInt32s you need to first set A1 := N mod Test (where A1 is a UInt32) and use that to check against 0.

I hope some compiler experts/programmers are reading this and picking this up (or maybe a bugfix needs to be entered in the tracker). The small slowdown with UInt32 from x86 to x64 isn't such a big deal (I'm positive that will impact Delphi too) but the bad optimization could be fixed (Delphi doesn't suffer from this).

coolmyf

  • Newbie
  • Posts: 3
Re: Parallel Code
« Reply #35 on: September 08, 2019, 08:14:05 am »
In my program, I changed int64 to uint64. I found the program run 3 times faster than before.

ASBzone

  • Hero Member
  • *****
  • Posts: 538
  • Automation leads to relaxation...
    • Free Console Utilities for Windows from BrainWaveCC
Re: Parallel Code
« Reply #36 on: January 25, 2021, 06:33:02 pm »
I changed the code (attached) and now I have the result as below.
Because the InterlockedIncrement can also be used to increase the CurrentIndex you see it's much faster.

I think this is the simplest code you're going to get with parallel threads in FPC. And no need for large library/underlying code like in Delphi.

@rvk, I really want to thank you for this code.  I looked at a number of Free Pascal compatible multiprocesssing/parallel_processing units and packages, and by far, this code was the simplest for me to adapt to my needs.

Ironically, I was also doing it for a prime number generator, but in my case it was a segmented sieve routine, so I applied the parallel FOR to the different segments.

Man, it is sweet.  :)

On an 8C/16T Ryzen 2700x, I can process 10 BILLION prime in 49.699 seconds using Segmented Sieve without multiprocessing. With 16 threads, however, that drops down to 4.944 seconds.  :D

Code: Text  [Select][+][-]
  1. >PrimeGenTest64.exe 10000000000
  2. PrimeGenTest64 v3.0.0.300e -- Current Time is 1/25/2021 12:25:25
  3.  
  4. Max Search ......... 10,000,000,000 (0x2540BE400)
  5. Total Primes .......    455,052,511 (0x1B1F8CDF)
  6. Highest Prime ......  9,999,999,967 (0x2540BE3DF)
  7. ---------------------------------------------------------------------------
  8. Start Time ......... 01/25/2021 12:25:25.829
  9. End Time ........... 01/25/2021 12:26:15.528
  10. Elapsed Time .......             0:00:49.699   <------------ NORMAL MODE (ONE THREAD)
  11. ---------------------------------------------------------------------------
  12.  
  13.  
  14. >PrimeGenTest64.exe 10000000000 /p:16
  15. PrimeGenTest64 v3.0.0.300e -- Current Time is 1/25/2021 12:26:15
  16.  
  17. Max Search ......... 10,000,000,000 (0x2540BE400)
  18. Total Primes .......    455,052,511 (0x1B1F8CDF)
  19. Highest Prime ......  9,999,999,967 (0x2540BE3DF)
  20. ---------------------------------------------------------------------------
  21. Start Time ......... 01/25/2021 12:26:15.660
  22. End Time ........... 01/25/2021 12:26:20.604
  23. Elapsed Time .......             0:00:04.944   <------------ PARALLEL MODE (16 THREADS)
  24. ---------------------------------------------------------------------------
  25.  
-ASB: https://www.BrainWaveCC.com

Lazarus v2.0.11 r64280 / FPC v3.2.1-r48417 (via FpcUpDeluxe) -- Windows 64-bit install w/32-bit cross-compile
Primary System: Windows 10 Pro x64, Version 2009 (Build 19042)
Other Systems: Windows 10 Pro x64, Version 2009 (Build 19042) or greater

Peter H

  • Full Member
  • ***
  • Posts: 195
Re: Parallel Code
« Reply #37 on: January 25, 2021, 08:58:01 pm »
I tried the program on my computer, it is an Athlon Phenom, 6 cores, 2.6 GHz.
When running single threaded the CPU runs overclocked at 3 GHz, when multithreaded it runs at 2.57 GHz. (when idle, it runs below 1GHz)
Running 6 threads is about 5 times faster than singlethreaded, this matches the expectations.
The difference 32 bit vs. 64 bit is minimal; 32 bit is 5% faster.
Optimization level O1 vs. O4 makes less than 3% improvement.
« Last Edit: January 25, 2021, 10:11:48 pm by Peter H »

ASBzone

  • Hero Member
  • *****
  • Posts: 538
  • Automation leads to relaxation...
    • Free Console Utilities for Windows from BrainWaveCC
Re: Parallel Code
« Reply #38 on: January 25, 2021, 10:58:42 pm »
I tried the program on my computer, it is an Athlon Phenom, 6 cores, 2.6 GHz.
When running single threaded the CPU runs overclocked at 3 GHz, when multithreaded it runs at 2.57 GHz. (when idle, it runs below 1GHz)
Running 6 threads is about 5 times faster than singlethreaded, this matches the expectations.
The difference 32 bit vs. 64 bit is minimal; 32 bit is 5% faster.
Optimization level O1 vs. O4 makes less than 3% improvement.

In segmented sieve, the performance of 64-bit vs 32-bit is substantial (double or better), whether single threaded or multi-threaded.
-ASB: https://www.BrainWaveCC.com

Lazarus v2.0.11 r64280 / FPC v3.2.1-r48417 (via FpcUpDeluxe) -- Windows 64-bit install w/32-bit cross-compile
Primary System: Windows 10 Pro x64, Version 2009 (Build 19042)
Other Systems: Windows 10 Pro x64, Version 2009 (Build 19042) or greater

 

TinyPortal © 2005-2018