Forum > General

64 bit oddities

(1/4) > >>

Madoc:
There's this rather inefficient thing the 64 bit compiler does when you write something like this:


--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---var  x, a: single;   a := x * x * (x * (x * (x * (x * (x * -0.1337503 + 1.2015647) -       4.0146944) + 5.7006375) - 2.1262625) - 1.6274950) + 1.0; 
The compiler will load each one of those constant values as a double, and then immediately convert it to a single. This is obviously pointless and expensive. It's not like we can use 1.0f like in C, and writing Single(1.0) for every value is rather cumbersome.

I've noticed a pattern of the compiler assuming that because the platform is 64 bit suddenly all types should always be 64 bit, not just pointers. This also leads to some weird implicit conversions and sometimes nasty sign extension bugs. With pointers being the obvious exception, types are chosen by the programmer based on usage and requirements. Nobody switches between single and double precision floating point math because of the platform.

In the Lazarus math unit there is a Float type defined as single or double precision depending on the platform. Floats are universally known (including in my large code base) as single precision. Again, the requirement for single or double precision depends on the application, not the platform. I have my own math functions so this isn't a big deal for me, but it still seems bad.

There is also the question of "integer" being the commonly used integer type (e.g. for loops) in Pascal code, which unlike C's int is not platform specific (though that's always 32 bit in practice anyway I think). FPC seems to convert to 64 bit anyway, with some bugs arising with negative values in loop bounds. I'm not sure what the FPC developers envisioned for these types. So far I've been re-writing loops in funky ways and avoiding signed math to work around the bugs.

Anyway, personally I'd like to see this gross inefficiency with constant float values fixed, or if there is a fix I'm unaware of please do tell. I'm unfamiliar with FPC's 32 bit code, so I might be missing something here, but it seems to me there's a notion that common basic types should change and function differently based on the platform, and if so it should be dispelled.


Thaddy:
Can you show us a bit of assembler output of what you mean?
Because the behavior of what you describe is NOT what is documented and what my code shows. What is documented is that the compiler may expand single values to double to keep precision and avoid overflows. It then performs the calculations as double and only then converts the result back to single. Most decent compilers do that. The generated code is very fast and the conversions take hardly any cycles.
I compiled like this:
fpc -CX -XXs -al -CfAVX2 -O4 untitled.pas (your calculation);

--- Code: ASM  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---# [4] begin        leaq    -56(%rsp),%rsp.Lc3:.seh_stackalloc 56        vmovdqa %xmm6,32(%rsp).seh_savexmm %xmm6, 32.seh_endprologue# Var x located in register xmm0        call    fpc_initializeunits.Ll2:        vpxor   %xmm0,%xmm0,%xmm0.Ll3:# [5] a := x * x * (x * (x * (x * (x * (x * -0.1337503 + 1.2015647) -        vmulss  %xmm0,%xmm0,%xmm1.Ll4:# [6] 4.0146944) + 5.7006375) - 2.1262625) - 1.6274950) + 1.0;        vcvtss2sd       %xmm1,%xmm1,%xmm5.Ll5:        vcvtss2sd       %xmm0,%xmm0,%xmm4        vcvtss2sd       %xmm0,%xmm0,%xmm3        vcvtss2sd       %xmm0,%xmm0,%xmm2        vcvtss2sd       %xmm0,%xmm0,%xmm1        vcvtss2sd       %xmm0,%xmm0,%xmm0        vmovsd  _$PROGRAM$_Ld1(%rip),%xmm6        vfmadd213sd     _$PROGRAM$_Ld2(%rip),%xmm6,%xmm0        vfmadd213sd     _$PROGRAM$_Ld3(%rip),%xmm0,%xmm1        vfmadd213sd     _$PROGRAM$_Ld4(%rip),%xmm1,%xmm2        vfmadd213sd     _$PROGRAM$_Ld5(%rip),%xmm2,%xmm3.Ll6:        vfmadd213sd     _$PROGRAM$_Ld6(%rip),%xmm3,%xmm4        vfmadd213sd     _$PROGRAM$_Ld7(%rip),%xmm4,%xmm5# Var a located in register xmm0.Ll7:        vcvtsd2ss       %xmm5,%xmm5,%xmm0.Ll8:# [7] end.        call    fpc_do_exit.seh_endproc
In default mode (-O2) I can't see that either,
I compiled like this:
fpc -CX -XXs -al untitled.pas

--- Code: ASM  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---# [5] a := x * x * (x * (x * (x * (x * (x * -0.1337503 + 1.2015647) -        movss   U_$P$PROGRAM_$$_X(%rip),%xmm0        mulss   U_$P$PROGRAM_$$_X(%rip),%xmm0        cvtss2sd        %xmm0,%xmm2        cvtss2sd        U_$P$PROGRAM_$$_X(%rip),%xmm0        mulsd   _$PROGRAM$_Ld1(%rip),%xmm0        addsd   _$PROGRAM$_Ld2(%rip),%xmm0        cvtss2sd        U_$P$PROGRAM_$$_X(%rip),%xmm1        mulsd   %xmm1,%xmm0        subsd   _$PROGRAM$_Ld3(%rip),%xmm0        cvtss2sd        U_$P$PROGRAM_$$_X(%rip),%xmm1        mulsd   %xmm1,%xmm0        addsd   _$PROGRAM$_Ld4(%rip),%xmm0        cvtss2sd        U_$P$PROGRAM_$$_X(%rip),%xmm1        mulsd   %xmm1,%xmm0        subsd   _$PROGRAM$_Ld5(%rip),%xmm0        cvtss2sd        U_$P$PROGRAM_$$_X(%rip),%xmm1        mulsd   %xmm1,%xmm0        subsd   _$PROGRAM$_Ld6(%rip),%xmm0        mulsd   %xmm2,%xmm0.Ll3:# [6] 4.0146944) + 5.7006375) - 2.1262625) - 1.6274950) + 1.0;        addsd   _$PROGRAM$_Ld7(%rip),%xmm0.Ll4:        cvtsd2ss        %xmm0,%xmm0        movss   %xmm0,U_$P$PROGRAM_$$_A(%rip).Ll5:# [7] end.
Compiled for x86_64 on Windows. Output on Linux64 is almost equal.


BildatBoffin:

--- Quote ---The compiler will load each one of those constant values as a double, and then immediately convert it to a single
--- End quote ---

AFAIK one problem is that the notion of immediate value does not exist for SSE or x87 instruction sets. Instructions allowing operands combinations like REG,IMM are rather the classic ALU ones.

Thaddy:
As you can see from my disassembly his statement is simply not true.
Both the optimized one and the default.
Here's the GNU C code output with default settings, shiver and cry:

--- Code: C  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---#include <stdio.h> int main(int argc, char **argv){_Float32 a;_Float32 x;       a = x * x * (x * (x * (x * (x * (x * -0.1337503 + 1.2015647) - \       4.0146944) + 5.7006375) - 2.1262625) - 1.6274950) + 1.0; return 0;}
--- Code: ASM  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---main:.LFB0:        .cfi_startproc        pushq   %rbp        .cfi_def_cfa_offset 16        .cfi_offset 6, -16        movq    %rsp, %rbp        .cfi_def_cfa_register 6        movl    %edi, -20(%rbp)        movq    %rsi, -32(%rbp)        movss   -4(%rbp), %xmm0        mulss   %xmm0, %xmm0        pxor    %xmm1, %xmm1        cvtss2sd        %xmm0, %xmm1        pxor    %xmm2, %xmm2        cvtss2sd        -4(%rbp), %xmm2        pxor    %xmm3, %xmm3        cvtss2sd        -4(%rbp), %xmm3        pxor    %xmm4, %xmm4        cvtss2sd        -4(%rbp), %xmm4        pxor    %xmm5, %xmm5        cvtss2sd        -4(%rbp), %xmm5        pxor    %xmm6, %xmm6        cvtss2sd        -4(%rbp), %xmm6        movsd   .LC0(%rip), %xmm0        mulsd   %xmm0, %xmm6        movsd   .LC1(%rip), %xmm0        addsd   %xmm6, %xmm0        mulsd   %xmm5, %xmm0        movsd   .LC2(%rip), %xmm5        subsd   %xmm5, %xmm0        mulsd   %xmm0, %xmm4        movsd   .LC3(%rip), %xmm0        addsd   %xmm4, %xmm0        mulsd   %xmm3, %xmm0        movsd   .LC4(%rip), %xmm3        subsd   %xmm3, %xmm0        mulsd   %xmm2, %xmm0        movsd   .LC5(%rip), %xmm2        subsd   %xmm2, %xmm0        mulsd   %xmm0, %xmm1        movsd   .LC6(%rip), %xmm0        addsd   %xmm1, %xmm0        cvtsd2ss        %xmm0, %xmm0        movss   %xmm0, -8(%rbp)        movl    $0, %eax        popq    %rbp        .cfi_def_cfa 7, 8        ret        .cfi_endproc The fpc code is magnitudes better. See for yourself. But you can also see that both compilers take the same approach.

BildatBoffin:
Yeah, the optimized version get rid of most of the `vcvtss2sd`. My answer was actually almost fully off topic... OP did not complain about the loads... they are unavoidable.

Navigation

[0] Message Index

[#] Next page

Go to full version