Forum > FPC development

Speed Matters - Bad Optimization of Mod/Div on x86/x64 with 32-Bit values

(1/1)

Phemtik:
This thread is an extension of the observation from the speed differences seen in the parallel code thread http://forum.lazarus.freepascal.org/index.php/topic,40099.30.html
But now under the name of the reason why there are differences.

It can hopefully used as reference for Bugfixes.
Also as background information, why right now are so huge gaps in speed with "div" and "mod" between some specific types.

After further tests, my result is that the 64-bit registers should be avoided wherever possible or only the 32-bit should be used.
Information material say that on all platforms (AMD/Intel) they are normally slower than the 32-bit registers and it is not necessary that the compiler use 64-bit registers with 32-bit values.

I found an exception from this rule and fpc already use that (sometimes) with UInt32 under "Div/Mod" but first some results.

System
CPU: i7-4770K@3.5Ghz
FPC: 3.1.1@r38365
Options: -O3 -CpCOREAVX -OpCOREAVX

loop 2,000,000,000 times         N mod/div 5  (floating dividend fixed divisior)
Times:   
"Mod" UInt32 x86 : 2,580 s      Inline: 1,553 s
"Mod" UInt32 x64 : 2,579 s      Inline: 1,550 s
"Mod" Int32 x86   : 5,685 s      Inline: 5,079 s
"Mod" Int32 x64   : 15,997 s    Inline: 15,503 s

"Div"  UInt32 x86 : 2,591 s      Inline: 1,207 s
"Div"  UInt32 x64 : 2,587 s      Inline: 1,208 s
"Div"  Int32 x86   : 2,591 s      Inline: 1,551 s
"Div"  Int32 x64   : 2,587 s      Inline: 1,558 s

--- Code: ---procedure Test1(const N: TTestType); {Inline;}
  var
    Q: TTestType;
  begin
    Q := N mod/div 5;
  end;
--- End code ---

loop 200,000,000 x 10 times      N mod/div I  (floating dividend and divisior)
Times:
"Mod" UInt32 x86 : 6,394 s      Inline: 6,290 s
"Mod" UInt32 x64 : 6,374 s      Inline: 5,528 s   -> EXCEPTION
"Mod" Int32 x86   : 4,850 s      Inline: 4,741 s
"Mod" Int32 x64   : 15,120 s    Inline: 14,073 s

"Div"  UInt32 x86 : 6,409 s      Inline: 6,294 s
"Div"  UInt32 x64 : 6,398 s      Inline: 5,545 s  -> EXCEPTION
"Div"  Int32 x86   : 4,871 s      Inline: 4,750 s
"Div"  Int32 x64   : 15,266 s    Inline: 14,143 s

--- Code: ---procedure Test2(const N: TTestType); {Inline;}
  var
    I: TTestType;
    Q: TTestType;
  begin
    for I := 1 to 10 do
      Q := N mod/div I;
  end
--- End code ---

Fixed Divisor

"Div" is really fast in this test and uses an optimized code that fpc generate because the divisor is an fixed value.

--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---project1.pas:28                   begin00401680 89c2                     mov    %eax,%edxproject1.pas:29                   Q := N div 5;00401682 b8cdcccccc               mov    $0xcccccccd,%eax00401687 f7e2                     mul    %edx00401689 c1ea02                   shr    $0x2,%edx
that is the assembly of the UInt32 x86 version x64 do the same but with 64-bit registers.
As we see fpc multiply and don't use a div opcode. multiplication is much faster than dividing.
And "Mod" look like the same.

--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---project1.pas:28                   begin00401680 89c1                     mov    %eax,%ecxproject1.pas:29                   Q := N mod 5;00401682 b8cdcccccc               mov    $0xcccccccd,%eax00401687 f7e1                     mul    %ecx00401689 c1ea02                   shr    $0x2,%edx0040168C 8d1492                   lea    (%edx,%edx,4),%edx0040168F 29d1                     sub    %edx,%ecxproject1.pas:30                   end;
after the wiki "mod" is that formula:
 I mod J = I - (I div J ) * J

The UInt32 "mod" do exactly that. We can even say fpc only extend the "div" calculation.
But with Int32 he do that

--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---project1.pas:28                   begin00401680 ba55555555               mov    $0x55555555,%edxproject1.pas:29                   Q := N mod 5;00401685 99                       cltd   00401686 b905000000               mov    $0x5,%ecx0040168B f7f9                     idiv   %ecxproject1.pas:30                   end;
He uses the slow IDIV opcode.
So i used that code to emulate the "mod".

--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---procedure Test3(const N: TTestType); {Inline;}  var    Q: TTestType;  begin    Q := N div 5;    Q := Q * 5;    Q := N - Q;  end;Times:
"Mod" Int32 x86   : 3,325 s      Inline: 1,821 s
"Mod" Int32 x64   : 3,106 s      Inline: 1,814 s

Now we have mostly expected times from a signed "mod" function.
It seems that the optimized "div" was forgotten to convert to "mod". ::)

Floating Dividend and Divisor

With both floating, fpc can not optimize any more and we have to use DIV/IDIV opcodes for the calculation of "mod" and "div".
there is no difference between "div" and "mod" in the assembly code (ok he copy a different register for the result).
as you can see there is also no speed difference.

why the Int32 is faster than UInt32 in x86 could only because of IDIV in Int32 and DIV in UInt32.
In the Handout it is written that IDIV is faster than DIV, but that is only true for Intel and i have no AMD for testing.   :(
Otherwise i have no idea why it is faster.

x86 Int32  "div"

--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---project1.pas:20                   begin00401680 53                       push   %ebx00401681 56                       push   %esi00401682 89c1                     mov    %eax,%ecx00401684 b855555555               mov    $0x55555555,%eax00401689 be55555555               mov    $0x55555555,%esiproject1.pas:21                   for I := 1 to 10 do0040168E 31db                     xor    %ebx,%ebx00401690 83c301                   add    $0x1,%ebxproject1.pas:22                   Q := N div I;00401693 89c8                     mov    %ecx,%eax00401695 99                       cltd   00401696 f7fb                     idiv   %ebx00401698 89c6                     mov    %eax,%esiproject1.pas:21                   for I := 1 to 10 do0040169A 83fb0a                   cmp    $0xa,%ebx0040169D 7cf1                     jl     0x401690 <MODTEST+16>project1.pas:23                   end;
x86 UInt32 "div"

--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---project1.pas:20                   begin00401680 53                       push   %ebx00401681 56                       push   %esi00401682 89c1                     mov    %eax,%ecx00401684 b855555555               mov    $0x55555555,%eax00401689 be55555555               mov    $0x55555555,%esiproject1.pas:21                   for I := 1 to 10 do0040168E 31db                     xor    %ebx,%ebx00401690 83c301                   add    $0x1,%ebxproject1.pas:22                   Q := N div I;00401693 89c8                     mov    %ecx,%eax00401695 31d2                     xor    %edx,%edx00401697 f7f3                     div    %ebx00401699 89c6                     mov    %eax,%esiproject1.pas:21                   for I := 1 to 10 do0040169B 83fb0a                   cmp    $0xa,%ebx0040169E 72f0                     jb     0x401690 <MODTEST+16>project1.pas:23                   end;
now the more interesting part why the x64 operation is faster than x86 and why Int32 is much slower.

first the assembly code

x64 Int32

--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---project1.pas:21                           for I := 1 to 10 do00000001000016A0 4531c9                   xor    %r9d,%r9d00000001000016A3 0f1f440000               nopl   0x0(%rax,%rax,1)00000001000016A8 4183c101                 add    $0x1,%r9dproject1.pas:22                           Q := N div I;00000001000016AC 4863c1                   movslq %ecx,%rax00000001000016AF 4d63c1                   movslq %r9d,%r800000001000016B2 4899                     cqto   00000001000016B4 49f7f8                   idiv   %r800000001000016B7 4189c2                   mov    %eax,%r10dproject1.pas:21                           for I := 1 to 10 do00000001000016BA 4183f90a                 cmp    $0xa,%r9d00000001000016BE 7ce8                     jl     0x1000016a8 <MODTEST+8>project1.pas:23                           end;
x64 UInt32

--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---project1.pas:21                           for I := 1 to 10 do00000001000016A0 4531c9                   xor    %r9d,%r9d00000001000016A3 0f1f440000               nopl   0x0(%rax,%rax,1)00000001000016A8 4183c101                 add    $0x1,%r9dproject1.pas:22                           Q := N div I;00000001000016AC 89c8                     mov    %ecx,%eax00000001000016AE 31d2                     xor    %edx,%edx00000001000016B0 41f7f1                   div    %r9d00000001000016B3 4189c0                   mov    %eax,%r8dproject1.pas:21                           for I := 1 to 10 do00000001000016B6 4183f90a                 cmp    $0xa,%r9d00000001000016BA 72ec                     jb     0x1000016a8 <MODTEST+8>
Now we see why the UInt32 64-Bit code is faster.
no push (and pop) of the %ebx and %esi registers, because 64-bit can freely use the r8 to r15 registers.

UInt32 is faster than Int32 because he uses only 32-bit registers.
r8 to r15 are 64-bit register but can be used as 32-bit registers, if called with an d on the end (%r8d).
And so he have full speed even with 64-bit registers.

Navigation

[0] Message Index

Go to full version