Speed Matters - Bad Optimization of Mod/Div on x86/x64 with 32-Bit values

Forum > FPC development

(1/1)

Phemtik:
This thread is an extension of the observation from the speed differences seen in the parallel code thread http://forum.lazarus.freepascal.org/index.php/topic,40099.30.html
But now under the name of the reason why there are differences.

It can hopefully used as reference for Bugfixes.
Also as background information, why right now are so huge gaps in speed with "div" and "mod" between some specific types.

After further tests, my result is that the 64-bit registers should be avoided wherever possible or only the 32-bit should be used.
Information material say that on all platforms (AMD/Intel) they are normally slower than the 32-bit registers and it is not necessary that the compiler use 64-bit registers with 32-bit values.

I found an exception from this rule and fpc already use that (sometimes) with UInt32 under "Div/Mod" but first some results.

System
CPU: i7-4770K@3.5Ghz
FPC: 3.1.1@r38365
Options: -O3 -CpCOREAVX -OpCOREAVX

loop 2,000,000,000 times N mod/div 5 (floating dividend fixed divisior)
Times:
"Mod" UInt32 x86 : 2,580 s Inline: 1,553 s
"Mod" UInt32 x64 : 2,579 s Inline: 1,550 s
"Mod" Int32 x86 : 5,685 s Inline: 5,079 s
"Mod" Int32 x64 : 15,997 s Inline: 15,503 s

"Div" UInt32 x86 : 2,591 s Inline: 1,207 s
"Div" UInt32 x64 : 2,587 s Inline: 1,208 s
"Div" Int32 x86 : 2,591 s Inline: 1,551 s
"Div" Int32 x64 : 2,587 s Inline: 1,558 s

--- Code: ---procedure Test1(const N: TTestType); {Inline;}
var
Q: TTestType;
begin
Q := N mod/div 5;
end;
--- End code ---

loop 200,000,000 x 10 times N mod/div I (floating dividend and divisior)
Times:
"Mod" UInt32 x86 : 6,394 s Inline: 6,290 s
"Mod" UInt32 x64 : 6,374 s Inline: 5,528 s -> EXCEPTION
"Mod" Int32 x86 : 4,850 s Inline: 4,741 s
"Mod" Int32 x64 : 15,120 s Inline: 14,073 s

"Div" UInt32 x86 : 6,409 s Inline: 6,294 s
"Div" UInt32 x64 : 6,398 s Inline: 5,545 s -> EXCEPTION
"Div" Int32 x86 : 4,871 s Inline: 4,750 s
"Div" Int32 x64 : 15,266 s Inline: 14,143 s

--- Code: ---procedure Test2(const N: TTestType); {Inline;}
var
I: TTestType;
Q: TTestType;
begin
for I := 1 to 10 do
Q := N mod/div I;
end
--- End code ---

Fixed Divisor

"Div" is really fast in this test and uses an optimized code that fpc generate because the divisor is an fixed value.

--- Code: Pascal [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---project1.pas:28 begin00401680 89c2 mov %eax,%edxproject1.pas:29 Q := N div 5;00401682 b8cdcccccc mov $0xcccccccd,%eax00401687 f7e2 mul %edx00401689 c1ea02 shr $0x2,%edx
that is the assembly of the UInt32 x86 version x64 do the same but with 64-bit registers.
As we see fpc multiply and don't use a div opcode. multiplication is much faster than dividing.
And "Mod" look like the same.

--- Code: Pascal [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---project1.pas:28 begin00401680 89c1 mov %eax,%ecxproject1.pas:29 Q := N mod 5;00401682 b8cdcccccc mov $0xcccccccd,%eax00401687 f7e1 mul %ecx00401689 c1ea02 shr $0x2,%edx0040168C 8d1492 lea (%edx,%edx,4),%edx0040168F 29d1 sub %edx,%ecxproject1.pas:30 end;
after the wiki "mod" is that formula:
I mod J = I - (I div J ) * J

The UInt32 "mod" do exactly that. We can even say fpc only extend the "div" calculation.
But with Int32 he do that

--- Code: Pascal [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---project1.pas:28 begin00401680 ba55555555 mov $0x55555555,%edxproject1.pas:29 Q := N mod 5;00401685 99 cltd 00401686 b905000000 mov $0x5,%ecx0040168B f7f9 idiv %ecxproject1.pas:30 end;
He uses the slow IDIV opcode.
So i used that code to emulate the "mod".

--- Code: Pascal [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---procedure Test3(const N: TTestType); {Inline;} var Q: TTestType; begin Q := N div 5; Q := Q * 5; Q := N - Q; end;Times:
"Mod" Int32 x86 : 3,325 s Inline: 1,821 s
"Mod" Int32 x64 : 3,106 s Inline: 1,814 s

Now we have mostly expected times from a signed "mod" function.
It seems that the optimized "div" was forgotten to convert to "mod". ::)

Floating Dividend and Divisor

With both floating, fpc can not optimize any more and we have to use DIV/IDIV opcodes for the calculation of "mod" and "div".
there is no difference between "div" and "mod" in the assembly code (ok he copy a different register for the result).
as you can see there is also no speed difference.

why the Int32 is faster than UInt32 in x86 could only because of IDIV in Int32 and DIV in UInt32.
In the Handout it is written that IDIV is faster than DIV, but that is only true for Intel and i have no AMD for testing. :(
Otherwise i have no idea why it is faster.

x86 Int32 "div"

--- Code: Pascal [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---project1.pas:20 begin00401680 53 push %ebx00401681 56 push %esi00401682 89c1 mov %eax,%ecx00401684 b855555555 mov $0x55555555,%eax00401689 be55555555 mov $0x55555555,%esiproject1.pas:21 for I := 1 to 10 do0040168E 31db xor %ebx,%ebx00401690 83c301 add $0x1,%ebxproject1.pas:22 Q := N div I;00401693 89c8 mov %ecx,%eax00401695 99 cltd 00401696 f7fb idiv %ebx00401698 89c6 mov %eax,%esiproject1.pas:21 for I := 1 to 10 do0040169A 83fb0a cmp $0xa,%ebx0040169D 7cf1 jl 0x401690 <MODTEST+16>project1.pas:23 end;
x86 UInt32 "div"

--- Code: Pascal [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---project1.pas:20 begin00401680 53 push %ebx00401681 56 push %esi00401682 89c1 mov %eax,%ecx00401684 b855555555 mov $0x55555555,%eax00401689 be55555555 mov $0x55555555,%esiproject1.pas:21 for I := 1 to 10 do0040168E 31db xor %ebx,%ebx00401690 83c301 add $0x1,%ebxproject1.pas:22 Q := N div I;00401693 89c8 mov %ecx,%eax00401695 31d2 xor %edx,%edx00401697 f7f3 div %ebx00401699 89c6 mov %eax,%esiproject1.pas:21 for I := 1 to 10 do0040169B 83fb0a cmp $0xa,%ebx0040169E 72f0 jb 0x401690 <MODTEST+16>project1.pas:23 end;
now the more interesting part why the x64 operation is faster than x86 and why Int32 is much slower.

first the assembly code

x64 Int32

--- Code: Pascal [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---project1.pas:21 for I := 1 to 10 do00000001000016A0 4531c9 xor %r9d,%r9d00000001000016A3 0f1f440000 nopl 0x0(%rax,%rax,1)00000001000016A8 4183c101 add $0x1,%r9dproject1.pas:22 Q := N div I;00000001000016AC 4863c1 movslq %ecx,%rax00000001000016AF 4d63c1 movslq %r9d,%r800000001000016B2 4899 cqto 00000001000016B4 49f7f8 idiv %r800000001000016B7 4189c2 mov %eax,%r10dproject1.pas:21 for I := 1 to 10 do00000001000016BA 4183f90a cmp $0xa,%r9d00000001000016BE 7ce8 jl 0x1000016a8 <MODTEST+8>project1.pas:23 end;
x64 UInt32

--- Code: Pascal [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---project1.pas:21 for I := 1 to 10 do00000001000016A0 4531c9 xor %r9d,%r9d00000001000016A3 0f1f440000 nopl 0x0(%rax,%rax,1)00000001000016A8 4183c101 add $0x1,%r9dproject1.pas:22 Q := N div I;00000001000016AC 89c8 mov %ecx,%eax00000001000016AE 31d2 xor %edx,%edx00000001000016B0 41f7f1 div %r9d00000001000016B3 4189c0 mov %eax,%r8dproject1.pas:21 for I := 1 to 10 do00000001000016B6 4183f90a cmp $0xa,%r9d00000001000016BA 72ec jb 0x1000016a8 <MODTEST+8>
Now we see why the UInt32 64-Bit code is faster.
no push (and pop) of the %ebx and %esi registers, because 64-bit can freely use the r8 to r15 registers.

UInt32 is faster than Int32 because he uses only 32-bit registers.
r8 to r15 are 64-bit register but can be used as 32-bit registers, if called with an d on the end (%r8d).
And so he have full speed even with 64-bit registers.

Navigation

[0] Message Index