Recent

Author Topic: Speed Matters - Bad Optimization of Mod/Div on x86/x64 with 32-Bit values  (Read 3500 times)

Phemtik

  • New Member
  • *
  • Posts: 19
This thread is an extension of the observation from the speed differences seen in the parallel code thread http://forum.lazarus.freepascal.org/index.php/topic,40099.30.html
But now under the name of the reason why there are differences.

It can hopefully used as reference for Bugfixes.
Also as background information, why right now are so huge gaps in speed with "div" and "mod" between some specific types.

After further tests, my result is that the 64-bit registers should be avoided wherever possible or only the 32-bit should be used.
Information material say that on all platforms (AMD/Intel) they are normally slower than the 32-bit registers and it is not necessary that the compiler use 64-bit registers with 32-bit values.

I found an exception from this rule and fpc already use that (sometimes) with UInt32 under "Div/Mod" but first some results.

System
CPU: i7-4770K@3.5Ghz
FPC: 3.1.1@r38365
Options: -O3 -CpCOREAVX -OpCOREAVX

loop 2,000,000,000 times         N mod/div 5  (floating dividend fixed divisior)
Times:   
"Mod" UInt32 x86 : 2,580 s      Inline: 1,553 s
"Mod" UInt32 x64 : 2,579 s      Inline: 1,550 s
"Mod" Int32 x86   : 5,685 s      Inline: 5,079 s
"Mod" Int32 x64   : 15,997 s    Inline: 15,503 s

"Div"  UInt32 x86 : 2,591 s      Inline: 1,207 s
"Div"  UInt32 x64 : 2,587 s      Inline: 1,208 s
"Div"  Int32 x86   : 2,591 s      Inline: 1,551 s
"Div"  Int32 x64   : 2,587 s      Inline: 1,558 s
Code: [Select]
procedure Test1(const N: TTestType); {Inline;}
  var
    Q: TTestType;
  begin
    Q := N mod/div 5;
  end;

loop 200,000,000 x 10 times      N mod/div I  (floating dividend and divisior)
Times:
"Mod" UInt32 x86 : 6,394 s      Inline: 6,290 s
"Mod" UInt32 x64 : 6,374 s      Inline: 5,528 s   -> EXCEPTION
"Mod" Int32 x86   : 4,850 s      Inline: 4,741 s
"Mod" Int32 x64   : 15,120 s    Inline: 14,073 s

"Div"  UInt32 x86 : 6,409 s      Inline: 6,294 s
"Div"  UInt32 x64 : 6,398 s      Inline: 5,545 s  -> EXCEPTION
"Div"  Int32 x86   : 4,871 s      Inline: 4,750 s
"Div"  Int32 x64   : 15,266 s    Inline: 14,143 s
Code: [Select]
procedure Test2(const N: TTestType); {Inline;}
  var
    I: TTestType;
    Q: TTestType;
  begin
    for I := 1 to 10 do
      Q := N mod/div I;
  end

Fixed Divisor

"Div" is really fast in this test and uses an optimized code that fpc generate because the divisor is an fixed value.
Code: Pascal  [Select][+][-]
  1. project1.pas:28                   begin
  2. 00401680 89c2                     mov    %eax,%edx
  3. project1.pas:29                   Q := N div 5;
  4. 00401682 b8cdcccccc               mov    $0xcccccccd,%eax
  5. 00401687 f7e2                     mul    %edx
  6. 00401689 c1ea02                   shr    $0x2,%edx

that is the assembly of the UInt32 x86 version x64 do the same but with 64-bit registers.
As we see fpc multiply and don't use a div opcode. multiplication is much faster than dividing.
And "Mod" look like the same.
Code: Pascal  [Select][+][-]
  1. project1.pas:28                   begin
  2. 00401680 89c1                     mov    %eax,%ecx
  3. project1.pas:29                   Q := N mod 5;
  4. 00401682 b8cdcccccc               mov    $0xcccccccd,%eax
  5. 00401687 f7e1                     mul    %ecx
  6. 00401689 c1ea02                   shr    $0x2,%edx
  7. 0040168C 8d1492                   lea    (%edx,%edx,4),%edx
  8. 0040168F 29d1                     sub    %edx,%ecx
  9. project1.pas:30                   end;

after the wiki "mod" is that formula:
 I mod J = I - (I div J ) * J

The UInt32 "mod" do exactly that. We can even say fpc only extend the "div" calculation.
But with Int32 he do that
Code: Pascal  [Select][+][-]
  1. project1.pas:28                   begin
  2. 00401680 ba55555555               mov    $0x55555555,%edx
  3. project1.pas:29                   Q := N mod 5;
  4. 00401685 99                       cltd  
  5. 00401686 b905000000               mov    $0x5,%ecx
  6. 0040168B f7f9                     idiv   %ecx
  7. project1.pas:30                   end;

He uses the slow IDIV opcode.
So i used that code to emulate the "mod".
Code: Pascal  [Select][+][-]
  1. procedure Test3(const N: TTestType); {Inline;}
  2.   var
  3.     Q: TTestType;
  4.   begin
  5.     Q := N div 5;
  6.     Q := Q * 5;
  7.     Q := N - Q;
  8.   end;
Times:
"Mod" Int32 x86   : 3,325 s      Inline: 1,821 s
"Mod" Int32 x64   : 3,106 s      Inline: 1,814 s

Now we have mostly expected times from a signed "mod" function.
It seems that the optimized "div" was forgotten to convert to "mod". ::)

Floating Dividend and Divisor

With both floating, fpc can not optimize any more and we have to use DIV/IDIV opcodes for the calculation of "mod" and "div".
there is no difference between "div" and "mod" in the assembly code (ok he copy a different register for the result).
as you can see there is also no speed difference.

why the Int32 is faster than UInt32 in x86 could only because of IDIV in Int32 and DIV in UInt32.
In the Handout it is written that IDIV is faster than DIV, but that is only true for Intel and i have no AMD for testing.   :(
Otherwise i have no idea why it is faster.

x86 Int32  "div"
Code: Pascal  [Select][+][-]
  1. project1.pas:20                   begin
  2. 00401680 53                       push   %ebx
  3. 00401681 56                       push   %esi
  4. 00401682 89c1                     mov    %eax,%ecx
  5. 00401684 b855555555               mov    $0x55555555,%eax
  6. 00401689 be55555555               mov    $0x55555555,%esi
  7. project1.pas:21                   for I := 1 to 10 do
  8. 0040168E 31db                     xor    %ebx,%ebx
  9. 00401690 83c301                   add    $0x1,%ebx
  10. project1.pas:22                   Q := N div I;
  11. 00401693 89c8                     mov    %ecx,%eax
  12. 00401695 99                       cltd  
  13. 00401696 f7fb                     idiv   %ebx
  14. 00401698 89c6                     mov    %eax,%esi
  15. project1.pas:21                   for I := 1 to 10 do
  16. 0040169A 83fb0a                   cmp    $0xa,%ebx
  17. 0040169D 7cf1                     jl     0x401690 <MODTEST+16>
  18. project1.pas:23                   end;

x86 UInt32 "div"
Code: Pascal  [Select][+][-]
  1. project1.pas:20                   begin
  2. 00401680 53                       push   %ebx
  3. 00401681 56                       push   %esi
  4. 00401682 89c1                     mov    %eax,%ecx
  5. 00401684 b855555555               mov    $0x55555555,%eax
  6. 00401689 be55555555               mov    $0x55555555,%esi
  7. project1.pas:21                   for I := 1 to 10 do
  8. 0040168E 31db                     xor    %ebx,%ebx
  9. 00401690 83c301                   add    $0x1,%ebx
  10. project1.pas:22                   Q := N div I;
  11. 00401693 89c8                     mov    %ecx,%eax
  12. 00401695 31d2                     xor    %edx,%edx
  13. 00401697 f7f3                     div    %ebx
  14. 00401699 89c6                     mov    %eax,%esi
  15. project1.pas:21                   for I := 1 to 10 do
  16. 0040169B 83fb0a                   cmp    $0xa,%ebx
  17. 0040169E 72f0                     jb     0x401690 <MODTEST+16>
  18. project1.pas:23                   end;

now the more interesting part why the x64 operation is faster than x86 and why Int32 is much slower.

first the assembly code

x64 Int32
Code: Pascal  [Select][+][-]
  1. project1.pas:21                           for I := 1 to 10 do
  2. 00000001000016A0 4531c9                   xor    %r9d,%r9d
  3. 00000001000016A3 0f1f440000               nopl   0x0(%rax,%rax,1)
  4. 00000001000016A8 4183c101                 add    $0x1,%r9d
  5. project1.pas:22                           Q := N div I;
  6. 00000001000016AC 4863c1                   movslq %ecx,%rax
  7. 00000001000016AF 4d63c1                   movslq %r9d,%r8
  8. 00000001000016B2 4899                     cqto  
  9. 00000001000016B4 49f7f8                   idiv   %r8
  10. 00000001000016B7 4189c2                   mov    %eax,%r10d
  11. project1.pas:21                           for I := 1 to 10 do
  12. 00000001000016BA 4183f90a                 cmp    $0xa,%r9d
  13. 00000001000016BE 7ce8                     jl     0x1000016a8 <MODTEST+8>
  14. project1.pas:23                           end;

x64 UInt32
Code: Pascal  [Select][+][-]
  1. project1.pas:21                           for I := 1 to 10 do
  2. 00000001000016A0 4531c9                   xor    %r9d,%r9d
  3. 00000001000016A3 0f1f440000               nopl   0x0(%rax,%rax,1)
  4. 00000001000016A8 4183c101                 add    $0x1,%r9d
  5. project1.pas:22                           Q := N div I;
  6. 00000001000016AC 89c8                     mov    %ecx,%eax
  7. 00000001000016AE 31d2                     xor    %edx,%edx
  8. 00000001000016B0 41f7f1                   div    %r9d
  9. 00000001000016B3 4189c0                   mov    %eax,%r8d
  10. project1.pas:21                           for I := 1 to 10 do
  11. 00000001000016B6 4183f90a                 cmp    $0xa,%r9d
  12. 00000001000016BA 72ec                     jb     0x1000016a8 <MODTEST+8>

Now we see why the UInt32 64-Bit code is faster.
no push (and pop) of the %ebx and %esi registers, because 64-bit can freely use the r8 to r15 registers.

UInt32 is faster than Int32 because he uses only 32-bit registers.
r8 to r15 are 64-bit register but can be used as 32-bit registers, if called with an d on the end (%r8d).
And so he have full speed even with 64-bit registers.

Intel i7-3610QM
Fedora 28

 

TinyPortal © 2005-2018