### Bookstore

 Computer Math and Games in Pascal (preview) Lazarus Handbook

### Author Topic: Parallel Code  (Read 9616 times)

#### rvk

• Hero Member
• Posts: 4327
##### Re: Parallel Code
« Reply #30 on: February 19, 2018, 01:00:51 pm »
Ok, I kept it to standard downloaded Lazarus 1.8 32 and 64 bit.
Now I concentrated on the MOD (which is heavily used in isPrime()).

Quote
Running to 2147483647
64 bit laz1.8 (standard download): ModTest() took 20,297 seconds
32 bit laz1.8 (standard download): ModTest() took 6,093 seconds

(Not sure what to think of this)

Code: Pascal  [Select][+][-]
1. program project1;
2. {\$MODE objfpc}
3. uses SysUtils;
4. const
5.   // cMax = 2147483647; // MaxInt;
6.   cMax = 50000000; // MaxInt;
7. var
8.   Tm: Double;
9.   I: NativeInt;
10.
11.   procedure ModTest(N: integer);
12.   var
13.     Test: NativeInt;
14.     Q: NativeInt;
15.   begin
16.     for Test := 1 to 50 do
17.     begin
18.       Q := N mod Test;
19.     end;
20.   end;
21.
22. begin
23.   Writeln('Running to ', MaxInt);
24.
25.   Tm := Now;
26.
27.   for I := 1 to cMax do
28.   begin
29.     ModTest(I);
30.   end;
31.
32.   Tm := Now - Tm;
33.   WriteLn(Format('ModTest() took %.3f seconds', [Tm * 24 * 60 * 60]));
34.
35.   Writeln('Press enter key');
36.   readln;
37. end.

#### Nitorami

• Sr. Member
• Posts: 378
##### Re: Parallel Code
« Reply #31 on: February 19, 2018, 02:43:32 pm »
Rather precisely the same result here, in 64 bit mode, modulo is factor 3 slower than in 32 bit mode, on an Intel CPU. What is even more surprising, if we force the 32bit-Compiler to use a 64 bit modulo, it calls fpc_mod_int64, a rather lengthy function, but is STILL not slower than in 64bit mode, where the division is a single idiv instruction...

Code: Pascal  [Select][+][-]
1. Dump of assembler code for function fpc_mod_int64:
2. :       push   %ebp
3. \$00405B11:      mov    %esp,%ebp
4. \$00405B13:      lea    -0x14(%esp),%esp
5. \$00405B17:      mov    %ebx,-0xc(%ebp)
6. \$00405B1A:      mov    %esi,-0x14(%ebp)
7. \$00405B1D:      mov    %edi,-0x10(%ebp)
8. \$00405B20:      mov    0x14(%ebp),%ecx
9. \$00405B23:      mov    0x10(%ebp),%ebx
10. \$00405B26:      mov    %ecx,%eax
11. \$00405B28:      or     %ebx,%eax
12. \$00405B2A:      jne    0x405b3d <fpc_mod_int64+45>
13. \$00405B2C:      mov    %ebp,%edx
14. \$00405B2E:      mov    \$0xc8,%eax
15. \$00405B33:      call   0x408b20 <SYSTEM_\$\$_HANDLEERRORFRAME\$LONGINT\$POINTER>
16. \$00405B38:      jmp    0x405be6 <fpc_mod_int64+214>
17. \$00405B3D:      mov    0xc(%ebp),%edx
18. \$00405B40:      mov    0x8(%ebp),%eax
19. \$00405B43:      mov    %edx,%esi
20. \$00405B45:      sar    \$0x1f,%esi
21. \$00405B48:      mov    %edx,%edi
22. \$00405B4A:      sar    \$0x1f,%edi
23. \$00405B4D:      xor    %edi,%eax
24. \$00405B4F:      xor    %edi,%edx
25. \$00405B51:      sub    %edi,%eax
26. \$00405B53:      sbb    %edi,%edx
27. \$00405B55:      mov    %ecx,%edi
28. \$00405B57:      sar    \$0x1f,%edi
29. \$00405B5A:      xor    %edi,%ebx
30. \$00405B5C:      xor    %edi,%ecx
31. \$00405B5E:      sub    %edi,%ebx
32. \$00405B60:      sbb    %edi,%ecx
33. \$00405B62:      jne    0x405b8a <fpc_mod_int64+122>
34. \$00405B64:      cmp    %ebx,%edx
35. \$00405B66:      jae    0x405b78 <fpc_mod_int64+104>
36. \$00405B68:      div    %ebx
37. \$00405B6A:      mov    %edx,%eax
38. \$00405B6C:      mov    %ecx,%edx
39. \$00405B6E:      xor    %esi,%eax
40. \$00405B70:      xor    %esi,%edx
41. \$00405B72:      sub    %esi,%eax
42. \$00405B74:      sbb    %esi,%edx
43. \$00405B76:      jmp    0x405be6 <fpc_mod_int64+214>
44. \$00405B78:      mov    %eax,%ecx
45. \$00405B7A:      mov    %edx,%eax
46. \$00405B7C:      xor    %edx,%edx
47. \$00405B7E:      div    %ebx
48. \$00405B80:      mov    %ecx,%eax
49. \$00405B82:      div    %ebx
50. \$00405B84:      mov    %edx,%eax
51. \$00405B86:      xor    %edx,%edx
52. \$00405B88:      jmp    0x405bde <fpc_mod_int64+206>
53. \$00405B8A:      sub    \$0x10,%esp
54. \$00405B8D:      mov    %eax,(%esp)
55. \$00405B90:      mov    %ebx,0x4(%esp)
56. \$00405B94:      mov    %edx,0x8(%esp)
57. \$00405B98:      mov    %ecx,0xc(%esp)
58. \$00405B9C:      mov    %ecx,%edi
59. \$00405B9E:      shr    %edx
60. .... etc etc.
61.

#### dieselfan

• New Member
• Posts: 16
##### Re: Parallel Code
« Reply #32 on: February 19, 2018, 07:25:12 pm »
I see 2 possible issues here

1. Huge difference between 32 and 64bit on Intel CPU's
2. The code on an AMD runs almost 3x slower in ST and if it weren't for the more cores would be very slow. Compared to other non-prime "benchmarks" I've run the AMD ST is roughly 105% the time taken vs a 7th gen i7. Here it's 3x vs a 3rd gen i7.

I have access to a couple of cpu's and generations and will test with Delphi too
AMD 1800X
Manjaro KDE
Rarely Windows 10 Pro

#### Phemtik

• New Member
• Posts: 19
##### Re: Parallel Code
« Reply #33 on: February 20, 2018, 09:15:21 pm »
I think i found a bit about the difference.

It's a mixture of a bad optimization and Intel.
So first the optimization.

The difference between x86 and x64 for UInt32 mostly comes from an doubled multiplication.
x86 do [ lea (%edx,%edx,2),%edx ] but x64 do [ imul  \$0x3,%edx,%edx ].
The Instruction Table document say, there should no speed difference but it seems there is little.

The difference between x86 Uint32 and x86 Int32 come from the needed IDIV instruction because of signed type

Now we come to the most Interesting point.
A combination of bad optimization and Intel.
Both belong together, because it is only an Intel thing.

Under X64 fpc generate code that use 64-bit registers (%rax and %rcx) that is not necessary. => Bad Optimization (for Intel)
Because Intel need after the document a lot more time and power if we use 64-bit registers.
On AMD it make no difference witch registers we use. Every register take the same time.

So it is a combination of bad optimization and Intel.

CPU: Intel i7-3610QM
FPC: 3.1.1@38232
Options: -O3 -OpCOREAVX -CpCOREAVX

Times
UInt32 x86: 4.897 s, 4.934 s, 4.873 s
Int32 x86   : 7.686 s, 7.675 s, 7.668 s
UInt32 x64: 5.150 s, 5.180 s, 5.228 s
Int32 x64   : 23.515 s, 23.409 s, 23.424 s

Instruction Tables:  http://www.agner.org/optimize/instruction_tables.pdf

The code i used:
Code: Pascal  [Select][+][-]
1. program project1;
2.
3. {\$mode objfpc}
4.
5. uses
6.     {\$IFDEF UNIX}{\$IFDEF UseCThreads}
7.     cthreads,
8.     {\$ENDIF}{\$ENDIF}
9.     SysUtils;
10.
11. type
12.   TTestType = UInt32;
13.   //TTestType = Int32;
14. const
15.   cMax = High(Int32);
16.   //cMax = 50000000;
17.   //cMax = 5;
18.
19. procedure ModTest(const N: TTestType);
20.   var
21.       Q: TTestType;
22.   begin
23.       Q := N mod 3;
24.   end;
25.
26. var
27.   Tm: Double;
28.   I: TTestType;
29.
30. begin
31.   Writeln('Running to ', cMax);
32.   Tm := Now;
33.   for I := 1 to cMax do
34.   begin
35.       ModTest(I);
36.   end;
37.   Tm := Now - Tm;
38.   WriteLn(Format('ModTest() took %.3f seconds', [Tm * 24 * 60 * 60]));
39. end.
40.

x86 UInt32:
Code: Pascal  [Select][+][-]
1. project1.lpr:22                   Q := N mod 3;
2. 08048112 b8abaaaaaa               mov    \$0xaaaaaaab,%eax
3. 08048117 f7e1                     mul    %ecx
4. 08048119 d1ea                     shr    %edx
5. 0804811B 8d1452                   lea    (%edx,%edx,2),%edx
6. 0804811E 29d1                     sub    %edx,%ecx

x86 Int32:
Code: Pascal  [Select][+][-]
1. project1.lpr:22                   Q := N mod 3;
2. 08048110 99                       cltd
3. 08048111 b903000000               mov    \$0x3,%ecx
4. 08048116 f7f9                     idiv   %ecx

x64 UInt32:
Code: Pascal  [Select][+][-]
1. project1.lpr:22                           Q := N mod 3;
2. 0000000000400200 b8abaaaaaa               mov    \$0xaaaaaaab,%eax
3. 0000000000400205 f7e7                     mul    %edi
4. 0000000000400207 d1ea                     shr    %edx
5. 0000000000400209 6bd203                   imul   \$0x3,%edx,%edx
6. 000000000040020C 29d7                     sub    %edx,%edi

x64 Int32:
Code: Pascal  [Select][+][-]
1. project1.lpr:22                           Q := N mod 3;
2. 0000000000400202 4863c0                   movslq %eax,%rax
3. 0000000000400205 4899                     cqto
4. 0000000000400207 b903000000               mov    \$0x3,%ecx
5. 000000000040020C 48f7f9                   idiv   %rcx

Intel i7-3610QM
Fedora 28

#### rvk

• Hero Member
• Posts: 4327
##### Re: Parallel Code
« Reply #34 on: February 20, 2018, 09:45:40 pm »
O wow, that's some good information.
And yes, changing the tests to UInt32 does result in much much more speed.

Even the original isPrime() function is faster although it needs to be reworked somewhat. Because "if (N mod Test) = 0" internally still converts to using a Int32 (?) even if N and Test are UInt32s you need to first set A1 := N mod Test (where A1 is a UInt32) and use that to check against 0.

I hope some compiler experts/programmers are reading this and picking this up (or maybe a bugfix needs to be entered in the tracker). The small slowdown with UInt32 from x86 to x64 isn't such a big deal (I'm positive that will impact Delphi too) but the bad optimization could be fixed (Delphi doesn't suffer from this).

#### coolmyf

• Newbie
• Posts: 3
##### Re: Parallel Code
« Reply #35 on: September 08, 2019, 08:14:05 am »
In my program, I changed int64 to uint64. I found the program run 3 times faster than before.