This thread is an extension of the observation from the speed differences seen in the parallel code thread
http://forum.lazarus.freepascal.org/index.php/topic,40099.30.htmlBut now under the name of the reason why there are differences.
It can hopefully used as reference for Bugfixes.
Also as background information, why right now are so huge gaps in speed with "div" and "mod" between some specific types.
After further tests, my result is that the 64-bit registers should be avoided wherever possible or only the 32-bit should be used.
Information material say that on all platforms (AMD/Intel) they are normally slower than the 32-bit registers and it is not necessary that the compiler use 64-bit registers with 32-bit values.
I found an exception from this rule and fpc already use that (sometimes) with UInt32 under "Div/Mod" but first some results.
System
CPU: i7-4770K@3.5Ghz
FPC: 3.1.1@r38365
Options: -O3 -CpCOREAVX -OpCOREAVX
loop 2,000,000,000 times N mod/div 5 (floating dividend fixed divisior)
Times:
"Mod" UInt32 x86 : 2,580 s Inline: 1,553 s
"Mod" UInt32 x64 : 2,579 s Inline: 1,550 s
"Mod" Int32 x86 : 5,685 s Inline: 5,079 s
"Mod" Int32 x64 : 15,997 s Inline: 15,503 s
"Div" UInt32 x86 : 2,591 s Inline: 1,207 s
"Div" UInt32 x64 : 2,587 s Inline: 1,208 s
"Div" Int32 x86 : 2,591 s Inline: 1,551 s
"Div" Int32 x64 : 2,587 s Inline: 1,558 s
procedure Test1(const N: TTestType); {Inline;}
var
Q: TTestType;
begin
Q := N mod/div 5;
end;
loop 200,000,000 x 10 times N mod/div I (floating dividend and divisior)
Times:
"Mod" UInt32 x86 : 6,394 s Inline: 6,290 s
"Mod" UInt32 x64 : 6,374 s Inline: 5,528 s -> EXCEPTION
"Mod" Int32 x86 : 4,850 s Inline: 4,741 s
"Mod" Int32 x64 : 15,120 s Inline: 14,073 s
"Div" UInt32 x86 : 6,409 s Inline: 6,294 s
"Div" UInt32 x64 : 6,398 s Inline: 5,545 s -> EXCEPTION
"Div" Int32 x86 : 4,871 s Inline: 4,750 s
"Div" Int32 x64 : 15,266 s Inline: 14,143 s
procedure Test2(const N: TTestType); {Inline;}
var
I: TTestType;
Q: TTestType;
begin
for I := 1 to 10 do
Q := N mod/div I;
end
Fixed Divisor"Div" is really fast in this test and uses an optimized code that fpc generate because the divisor is an fixed value.
project1.pas:28 begin
00401680 89c2 mov %eax,%edx
project1.pas:29 Q := N div 5;
00401682 b8cdcccccc mov $0xcccccccd,%eax
00401687 f7e2 mul %edx
00401689 c1ea02 shr $0x2,%edx
that is the assembly of the UInt32 x86 version x64 do the same but with 64-bit registers.
As we see fpc multiply and don't use a div opcode. multiplication is much faster than dividing.
And "Mod" look like the same.
project1.pas:28 begin
00401680 89c1 mov %eax,%ecx
project1.pas:29 Q := N mod 5;
00401682 b8cdcccccc mov $0xcccccccd,%eax
00401687 f7e1 mul %ecx
00401689 c1ea02 shr $0x2,%edx
0040168C 8d1492 lea (%edx,%edx,4),%edx
0040168F 29d1 sub %edx,%ecx
project1.pas:30 end;
after the wiki "mod" is that formula:
I mod J = I - (I div J ) * J
The UInt32 "mod" do exactly that. We can even say fpc only extend the "div" calculation.
But with Int32 he do that
project1.pas:28 begin
00401680 ba55555555 mov $0x55555555,%edx
project1.pas:29 Q := N mod 5;
00401685 99 cltd
00401686 b905000000 mov $0x5,%ecx
0040168B f7f9 idiv %ecx
project1.pas:30 end;
He uses the slow IDIV opcode.
So i used that code to emulate the "mod".
procedure Test3(const N: TTestType); {Inline;}
var
Q: TTestType;
begin
Q := N div 5;
Q := Q * 5;
Q := N - Q;
end;
Times:
"Mod" Int32 x86 : 3,325 s Inline: 1,821 s
"Mod" Int32 x64 : 3,106 s Inline: 1,814 s
Now we have mostly expected times from a signed "mod" function.
It seems that the optimized "div" was forgotten to convert to "mod".
Floating Dividend and Divisor With both floating, fpc can not optimize any more and we have to use DIV/IDIV opcodes for the calculation of "mod" and "div".
there is no difference between "div" and "mod" in the assembly code (ok he copy a different register for the result).
as you can see there is also no speed difference.
why the Int32 is faster than UInt32 in x86 could only because of IDIV in Int32 and DIV in UInt32.
In the Handout it is written that IDIV is faster than DIV, but that is only true for Intel and i have no AMD for testing.
Otherwise i have no idea why it is faster.
x86 Int32 "div"
project1.pas:20 begin
00401680 53 push %ebx
00401681 56 push %esi
00401682 89c1 mov %eax,%ecx
00401684 b855555555 mov $0x55555555,%eax
00401689 be55555555 mov $0x55555555,%esi
project1.pas:21 for I := 1 to 10 do
0040168E 31db xor %ebx,%ebx
00401690 83c301 add $0x1,%ebx
project1.pas:22 Q := N div I;
00401693 89c8 mov %ecx,%eax
00401695 99 cltd
00401696 f7fb idiv %ebx
00401698 89c6 mov %eax,%esi
project1.pas:21 for I := 1 to 10 do
0040169A 83fb0a cmp $0xa,%ebx
0040169D 7cf1 jl 0x401690 <MODTEST+16>
project1.pas:23 end;
x86 UInt32 "div"
project1.pas:20 begin
00401680 53 push %ebx
00401681 56 push %esi
00401682 89c1 mov %eax,%ecx
00401684 b855555555 mov $0x55555555,%eax
00401689 be55555555 mov $0x55555555,%esi
project1.pas:21 for I := 1 to 10 do
0040168E 31db xor %ebx,%ebx
00401690 83c301 add $0x1,%ebx
project1.pas:22 Q := N div I;
00401693 89c8 mov %ecx,%eax
00401695 31d2 xor %edx,%edx
00401697 f7f3 div %ebx
00401699 89c6 mov %eax,%esi
project1.pas:21 for I := 1 to 10 do
0040169B 83fb0a cmp $0xa,%ebx
0040169E 72f0 jb 0x401690 <MODTEST+16>
project1.pas:23 end;
now the more interesting part why the x64 operation is faster than x86 and why Int32 is much slower.
first the assembly code
x64 Int32
project1.pas:21 for I := 1 to 10 do
00000001000016A0 4531c9 xor %r9d,%r9d
00000001000016A3 0f1f440000 nopl 0x0(%rax,%rax,1)
00000001000016A8 4183c101 add $0x1,%r9d
project1.pas:22 Q := N div I;
00000001000016AC 4863c1 movslq %ecx,%rax
00000001000016AF 4d63c1 movslq %r9d,%r8
00000001000016B2 4899 cqto
00000001000016B4 49f7f8 idiv %r8
00000001000016B7 4189c2 mov %eax,%r10d
project1.pas:21 for I := 1 to 10 do
00000001000016BA 4183f90a cmp $0xa,%r9d
00000001000016BE 7ce8 jl 0x1000016a8 <MODTEST+8>
project1.pas:23 end;
x64 UInt32
project1.pas:21 for I := 1 to 10 do
00000001000016A0 4531c9 xor %r9d,%r9d
00000001000016A3 0f1f440000 nopl 0x0(%rax,%rax,1)
00000001000016A8 4183c101 add $0x1,%r9d
project1.pas:22 Q := N div I;
00000001000016AC 89c8 mov %ecx,%eax
00000001000016AE 31d2 xor %edx,%edx
00000001000016B0 41f7f1 div %r9d
00000001000016B3 4189c0 mov %eax,%r8d
project1.pas:21 for I := 1 to 10 do
00000001000016B6 4183f90a cmp $0xa,%r9d
00000001000016BA 72ec jb 0x1000016a8 <MODTEST+8>
Now we see why the UInt32 64-Bit code is faster.
no push (and pop) of the %ebx and %esi registers, because 64-bit can freely use the r8 to r15 registers.
UInt32 is faster than Int32 because he uses only 32-bit registers.
r8 to r15 are 64-bit register but can be used as 32-bit registers, if called with an d on the end (%r8d).
And so he have full speed even with 64-bit registers.