* * *

Author Topic: A little ASM output for anyone doubting FPCs optimization capabilities...  (Read 1207 times)

Akira1364

  • Sr. Member
  • ****
  • Posts: 327
Just thought I'd post this here as I've been very impressed with the trunk compilers code optimizer recently.

The relevant Pascal code:

Code: Pascal  [Select]
  1. TVertex = record
  2.   X, Y, Z: Single;
  3. end;
  4.  
  5. function VertexAdd(constref V1, V2: TVertex): TVertex;
  6. begin
  7.   with Result do
  8.   begin
  9.     X := V1.X + V2.X;
  10.     Y := V1.Y + V2.Y;
  11.     Z := V1.Z + V2.Z;
  12.   end;
  13. end;

The relevant C code (using the JavaScript tags because they have the closest syntax highlighter):

Code: Javascript  [Select]
  1. struct TVertex
  2. {
  3.     float X, Y, Z;
  4. };
  5.  
  6. struct TVertex VertexAdd(const struct TVertex V1, const struct TVertex V2)
  7. {
  8.     struct TVertex Result;
  9.     Result.X = V1.X + V2.X;
  10.     Result.Y = V1.Y + V2.Y;
  11.     Result.Z = V1.Z + V2.Z;
  12.     return Result;
  13. }

Trunk FPC ASM output (compiled with -CfAVX2 -CpCOREAVX2 -g- -O4 -OpCOREAVX2 -Sv):

Code: Pascal  [Select]
  1. .section .text.n_unit1_$$_vertexadd$tvertex$tvertex$$tvertex,"x"
  2.         .balign 16,0x90
  3. .globl  UNIT1_$$_VERTEXADD$TVERTEX$TVERTEX$$TVERTEX
  4. UNIT1_$$_VERTEXADD$TVERTEX$TVERTEX$$TVERTEX:
  5. .Lc1:
  6.         movq    %rcx,%rax
  7.         vmovss  (%rdx),%xmm0
  8.         vaddss  (%r8),%xmm0,%xmm0
  9.         vmovss  %xmm0,(%rax)
  10.         vmovss  4(%rdx),%xmm0
  11.         vaddss  4(%r8),%xmm0,%xmm0
  12.         vmovss  %xmm0,4(%rax)
  13.         vmovss  8(%rdx),%xmm0
  14.         vaddss  8(%r8),%xmm0,%xmm0
  15.         vmovss  %xmm0,8(%rax)
  16.         ret
  17. .Lc2:

GCC 7.1 ASM output (compiled with -g0 -march=haswell -mtune=haswell -mavx2 -Ofast -s):

Code: Pascal  [Select]
  1.         .globl  VertexAdd
  2.         .def    VertexAdd;      .scl    2;      .type   32;     .endef
  3.         .seh_proc       VertexAdd
  4. VertexAdd:
  5.         .seh_endprologue
  6.         vmovss  4(%rdx), %xmm0
  7.         vmovss  (%rdx), %xmm2
  8.         vaddss  4(%r8), %xmm0, %xmm1
  9.         vaddss  (%r8), %xmm2, %xmm2
  10.         vmovss  8(%rdx), %xmm0
  11.         vaddss  8(%r8), %xmm0, %xmm0
  12.         movq    %rcx, %rax
  13.         vmovss  %xmm2, (%rcx)
  14.         vmovss  %xmm1, 4(%rcx)
  15.         vmovss  %xmm0, 8(%rcx)
  16.         ret
  17.         .seh_endproc

Essentially exactly the same code, just in a slightly different order, which demonstrates that FPC can and should be perfectly comparable with C/C++ when given input that is directly one-to-one-mappable with them. Very cool IMO!
« Last Edit: July 03, 2017, 12:39:45 am by Akira1364 »

ASerge

  • Sr. Member
  • ****
  • Posts: 454
...which demonstrates that FPC can and should be perfectly comparable with C/C++ when given directly one-to-one-mappable input. Very cool IMO!
There are still challenges for development.
Suppose that only two calculation pipelines, and denote 1 - the execution cycle, + is parallel calculation with the previous command.
FPC
Code: [Select]
1        movq    %rcx,%rax
+        vmovss  (%rdx),%xmm0
1        vaddss  (%r8),%xmm0,%xmm0
1        vmovss  %xmm0,(%rax)
1        vmovss  4(%rdx),%xmm0
1        vaddss  4(%r8),%xmm0,%xmm0
1        vmovss  %xmm0,4(%rax)
1        vmovss  8(%rdx),%xmm0
1        vaddss  8(%r8),%xmm0,%xmm0
1        vmovss  %xmm0,8(%rax)
Equal 9.

GCC
Code: [Select]
1        vmovss  4(%rdx), %xmm0
+        vmovss  (%rdx), %xmm2
1        vaddss  4(%r8), %xmm0, %xmm1
+        vaddss  (%r8), %xmm2, %xmm2
1        vmovss  8(%rdx), %xmm0
1        vaddss  8(%r8), %xmm0, %xmm0
1        movq    %rcx, %rax
+        vmovss  %xmm2, (%rcx)
1        vmovss  %xmm1, 4(%rcx)
+        vmovss  %xmm0, 8(%rcx)
Equal 6.
The number of registers involved is, of course, less, but is it so important?

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 5826
One can check cycle counts with the Intel IACA tool.

prino

  • Newbie
  • Posts: 2
So I installed FPC again, for yet another try, as I'm getting a bit tired of DB coding MMX/SSE/AVX instructions in Virtual Pascal.

Very, very,very obviously my version of LIFT (in lift32bit.rar) that's nearly fully recoded into inline assembler does not compile, but after changing eight definitions of variables in hhcommon.pas from longint to word (for the DOS GetTime function), the "Pure Pascal" version compiles without problems.

However, it ABENDs (z/OS parlance for crash) in "readfile" with

RunTime Error 2130567168
Error address $00000000

Which is less than useful...

So let's just look a bit at the generated code...

My benchmark is a routine that does a Shellsort on an array of pointers, comparing two variables in the list items the pointers point to. The code generated for this, enabling all optimizations in the IDE makes me very sad, as it's hardly better than what Turbo Pascal used to generate more than three decades ago! Obviously I don't expect it the code to come anywhere close to manually optimized code, but this code generated doesn't even try:

Code: [Select]
; [256] sort_wptr := wait^[_i];
mov eax,dword ptr [ebp-544]
mov edx,dword ptr [eax]
mov eax,dword ptr [ebp-524]
mov eax,dword ptr [edx+eax*4-4]
mov dword ptr [ebp-540],eax
; [258] sort_trip := wait^[_i]^.trip;
mov eax,dword ptr [ebp-544]
mov edx,dword ptr [eax]
mov eax,dword ptr [ebp-524]
mov edx,dword ptr [edx+eax*4-4]
mov eax,dword ptr [edx+8]
mov dword ptr [ebp-532],eax
; [259] sort_cnty := wait^[_i]^.s_cnty;
mov eax,dword ptr [ebp-544]
mov ecx,dword ptr [eax]
mov edx,dword ptr [ebp-524]
mov eax,dword ptr [ecx+edx*4-4]
mov eax,dword ptr [eax+64]
mov dword ptr [ebp-4],eax
; [260] sort_year := wait^[_i]^.date.dyear;
mov eax,dword ptr [ebp-544]
mov edx,dword ptr [eax]
mov eax,dword ptr [ebp-524]
mov edx,dword ptr [edx+eax*4-4]
mov eax,dword ptr [edx+104]
mov dword ptr [ebp-536],eax
; [261] sort_wtime:= wait^[_i]^.wtime;
mov eax,dword ptr [ebp-544]
mov edx,dword ptr [eax]
mov eax,dword ptr [ebp-524]
mov edx,dword ptr [edx+eax*4-4]
mov eax,dword ptr [edx+68]
mov dword ptr [ebp-528],eax

My equivalent, hand-optimized, code

Code: [Select]
mov   edx, [ebx * 4 + esi]
mov   sort_wptr, edx

mov   eax, [edx + offset lift_list.trip]
mov   sort_trip, eax

mov   eax, [edx + offset lift_list.s_cnty]
mov   sort_cnty, eax

mov   eax, [edx + offset lift_list.date.dyear]
mov   sort_year, eax

mov   eax, [edx + offset lift_list.wtime]
mov   sort_wtime, eax

Common sub-expression elimination? No...
Register variables? No...

Other missed optimizations?

Using FISTTP for truncation? No...
Using FWAIT AD 2017 on a CPU that supports AVX? Ouch...

and code like this

Code: [Select]
; [715] inc(_minmax[0].max.km,   ltd_ptr^.dtv.km);
mov eax,dword ptr [dword ptr TC_$HHCOMMON_$$_LTD_PTR]
mov eax,dword ptr [eax+20]
add dword ptr [dword ptr U_$HHCOMMON_$$__MINMAX+8],eax
; [716] inc(_minmax[0].max.time, ltd_ptr^.dtv.time);
mov eax,dword ptr [dword ptr TC_$HHCOMMON_$$_LTD_PTR]
mov eax,dword ptr [eax+24]
add dword ptr [dword ptr U_$HHCOMMON_$$__MINMAX+12],eax
is screaming out for MMX/XMM conversion.

Sigh... or in other words, unless I'm doing something very wrong, I'm pretty disappointed in FPC's optimizing capabilities...


Thaddy

  • Hero Member
  • *****
  • Posts: 4651
FPC is just conservative in its default settings. You have to specify the desired optimizations, although -O4 will turn most, but not all of them on.
If you want AVX code for example, you will need to specify that. If it is vector code, specify vector. (-CfAVX or -CfAVX2 resp. -Sv)
A generic compiler usually optimizes worse case specified. Because the output needs to run on a worse case specced computer too.

So back to testing, research the optimization options test with -s and examine the output. >:D
Btw: first checkout the docs on CSE... that is working... So before you accuse, first read the ff'ing docs, like here https://www.freepascal.org/docs-html/prog/progsu58.html or here https://www.freepascal.org/docs-html/user/userap1.html or....
The rest can also be found in the docs.

Provided you specify and optimize for the right processor and FPU instruction set FPC can do a very good job. Just not by default, for the reasons mentioned.
There is room for improvement, but half of what you want can be done and will be done already.
« Last Edit: July 02, 2017, 10:26:02 pm by Thaddy »
"Logically, no number of positive outcomes at the level of experimental testing can confirm a scientific theory, but a single counterexample is logically decisive."

Akira1364

  • Sr. Member
  • ****
  • Posts: 327
Very, very,very obviously my version of LIFT (in lift32bit.rar)

That file doesn't exist in the dropbox folder.

ykot

  • Full Member
  • ***
  • Posts: 136
You might want to check the following discussion. For vector/matrix operations, unless you use -Sv and manually tune the math code, GCC/LLVM produces much more efficient output.

Also, for C++ comparisons, you might want online compiler, which includes GCC and LLVM: https://godbolt.org

By the way, Clang/LLVM produces very nice vector sum:

Code: [Select]
struct Vector
{
  float x;
  float y;
  float z;

  Vector operator + (Vector const& vector) const;
};

Vector Vector::operator + (Vector const& vector) const
{
  return {x + vector.x, y + vector.y, z + vector.z};
}

Resulting compiled code is:
Code: [Select]
Vector::operator+(Vector const&) const:                     # @Vector::operator+(Vector const&) const
        movsd   xmm1, qword ptr [rdi]   # xmm1 = mem[0],zero
        movsd   xmm0, qword ptr [rsi]   # xmm0 = mem[0],zero
        addps   xmm0, xmm1
        movss   xmm1, dword ptr [rdi + 8] # xmm1 = mem[0],zero,zero,zero
        addss   xmm1, dword ptr [rsi + 8]
        ret

As you see, it's much different than what you've posted initially.

Here's clicky to try yourself. So yes, it would be great to implement such compiler optimizations in FPC some day, perhaps with its LLVM backend?

Akira1364

  • Sr. Member
  • ****
  • Posts: 327
For vector/matrix operations, unless you use -Sv and manually tune the math code, GCC/LLVM produces much more efficient output.

I realized later that I actually did have -Sv set as well (I almost always do) and edited my initial post to reflect that. I still stand by my point that FPC is really "getting there" as far as optimization these days, though.  :)

 

Recent

Get Lazarus at SourceForge.net. Fast, secure and Free Open Source software downloads Open Hub project report for Lazarus