### Bookstore

 Computer Math and Games in Pascal (preview) Lazarus Handbook (preview only)

### Author Topic: A little ASM output for anyone doubting FPCs optimization capabilities...  (Read 3306 times)

#### Akira1364

• Hero Member
• Posts: 530
##### A little ASM output for anyone doubting FPCs optimization capabilities...
« on: July 01, 2017, 10:04:24 pm »
Just thought I'd post this here as I've been very impressed with the trunk compilers code optimizer recently.

The relevant Pascal code:

Code: Pascal  [Select]
1. TVertex = record
2.   X, Y, Z: Single;
3. end;
4.
5. function VertexAdd(constref V1, V2: TVertex): TVertex;
6. begin
7.   with Result do
8.   begin
9.     X := V1.X + V2.X;
10.     Y := V1.Y + V2.Y;
11.     Z := V1.Z + V2.Z;
12.   end;
13. end;

The relevant C code (using the JavaScript tags because they have the closest syntax highlighter):

Code: Javascript  [Select]
1. struct TVertex
2. {
3.     float X, Y, Z;
4. };
5.
6. struct TVertex VertexAdd(const struct TVertex V1, const struct TVertex V2)
7. {
8.     struct TVertex Result;
9.     Result.X = V1.X + V2.X;
10.     Result.Y = V1.Y + V2.Y;
11.     Result.Z = V1.Z + V2.Z;
12.     return Result;
13. }

Trunk FPC ASM output (compiled with -CfAVX2 -CpCOREAVX2 -g- -O4 -OpCOREAVX2 -Sv):

Code: Pascal  [Select]
2.         .balign 16,0x90
5. .Lc1:
6.         movq    %rcx,%rax
7.         vmovss  (%rdx),%xmm0
9.         vmovss  %xmm0,(%rax)
10.         vmovss  4(%rdx),%xmm0
12.         vmovss  %xmm0,4(%rax)
13.         vmovss  8(%rdx),%xmm0
15.         vmovss  %xmm0,8(%rax)
16.         ret
17. .Lc2:

GCC 7.1 ASM output (compiled with -g0 -march=haswell -mtune=haswell -mavx2 -Ofast -s):

Code: Pascal  [Select]
2.         .def    VertexAdd;      .scl    2;      .type   32;     .endef
5.         .seh_endprologue
6.         vmovss  4(%rdx), %xmm0
7.         vmovss  (%rdx), %xmm2
8.         vaddss  4(%r8), %xmm0, %xmm1
9.         vaddss  (%r8), %xmm2, %xmm2
10.         vmovss  8(%rdx), %xmm0
11.         vaddss  8(%r8), %xmm0, %xmm0
12.         movq    %rcx, %rax
13.         vmovss  %xmm2, (%rcx)
14.         vmovss  %xmm1, 4(%rcx)
15.         vmovss  %xmm0, 8(%rcx)
16.         ret
17.         .seh_endproc

Essentially exactly the same code, just in a slightly different order, which demonstrates that FPC can and should be perfectly comparable with C/C++ when given input that is directly one-to-one-mappable with them. Very cool IMO!
« Last Edit: July 03, 2017, 12:39:45 am by Akira1364 »

#### ASerge

• Hero Member
• Posts: 1406
##### Re: A little ASM output for anyone doubting FPCs optimization capabilities...
« Reply #1 on: July 01, 2017, 11:30:03 pm »
...which demonstrates that FPC can and should be perfectly comparable with C/C++ when given directly one-to-one-mappable input. Very cool IMO!
There are still challenges for development.
Suppose that only two calculation pipelines, and denote 1 - the execution cycle, + is parallel calculation with the previous command.
FPC
Code: [Select]
`1        movq    %rcx,%rax+        vmovss  (%rdx),%xmm01        vaddss  (%r8),%xmm0,%xmm01        vmovss  %xmm0,(%rax)1        vmovss  4(%rdx),%xmm01        vaddss  4(%r8),%xmm0,%xmm01        vmovss  %xmm0,4(%rax)1        vmovss  8(%rdx),%xmm01        vaddss  8(%r8),%xmm0,%xmm01        vmovss  %xmm0,8(%rax)`Equal 9.

GCC
Code: [Select]
`1        vmovss  4(%rdx), %xmm0+        vmovss  (%rdx), %xmm21        vaddss  4(%r8), %xmm0, %xmm1+        vaddss  (%r8), %xmm2, %xmm21        vmovss  8(%rdx), %xmm01        vaddss  8(%r8), %xmm0, %xmm01        movq    %rcx, %rax+        vmovss  %xmm2, (%rcx)1        vmovss  %xmm1, 4(%rcx)+        vmovss  %xmm0, 8(%rcx)`Equal 6.
The number of registers involved is, of course, less, but is it so important?

#### marcov

• Global Moderator
• Hero Member
• Posts: 7438
##### Re: A little ASM output for anyone doubting FPCs optimization capabilities...
« Reply #2 on: July 01, 2017, 11:35:14 pm »
One can check cycle counts with the Intel IACA tool.

#### prino

• Newbie
• Posts: 2
##### Re: A little ASM output for anyone doubting FPCs optimization capabilities...
« Reply #3 on: July 02, 2017, 07:45:14 pm »
So I installed FPC again, for yet another try, as I'm getting a bit tired of DB coding MMX/SSE/AVX instructions in Virtual Pascal.

Very, very,very obviously my version of LIFT (in lift32bit.rar) that's nearly fully recoded into inline assembler does not compile, but after changing eight definitions of variables in hhcommon.pas from longint to word (for the DOS GetTime function), the "Pure Pascal" version compiles without problems.

However, it ABENDs (z/OS parlance for crash) in "readfile" with

RunTime Error 2130567168

Which is less than useful...

So let's just look a bit at the generated code...

My benchmark is a routine that does a Shellsort on an array of pointers, comparing two variables in the list items the pointers point to. The code generated for this, enabling all optimizations in the IDE makes me very sad, as it's hardly better than what Turbo Pascal used to generate more than three decades ago! Obviously I don't expect it the code to come anywhere close to manually optimized code, but this code generated doesn't even try:

Code: [Select]
`; [256] sort_wptr := wait^[_i]; mov eax,dword ptr [ebp-544] mov edx,dword ptr [eax] mov eax,dword ptr [ebp-524] mov eax,dword ptr [edx+eax*4-4] mov dword ptr [ebp-540],eax; [258] sort_trip := wait^[_i]^.trip; mov eax,dword ptr [ebp-544] mov edx,dword ptr [eax] mov eax,dword ptr [ebp-524] mov edx,dword ptr [edx+eax*4-4] mov eax,dword ptr [edx+8] mov dword ptr [ebp-532],eax; [259] sort_cnty := wait^[_i]^.s_cnty; mov eax,dword ptr [ebp-544] mov ecx,dword ptr [eax] mov edx,dword ptr [ebp-524] mov eax,dword ptr [ecx+edx*4-4] mov eax,dword ptr [eax+64] mov dword ptr [ebp-4],eax; [260] sort_year := wait^[_i]^.date.dyear; mov eax,dword ptr [ebp-544] mov edx,dword ptr [eax] mov eax,dword ptr [ebp-524] mov edx,dword ptr [edx+eax*4-4] mov eax,dword ptr [edx+104] mov dword ptr [ebp-536],eax; [261] sort_wtime:= wait^[_i]^.wtime; mov eax,dword ptr [ebp-544] mov edx,dword ptr [eax] mov eax,dword ptr [ebp-524] mov edx,dword ptr [edx+eax*4-4] mov eax,dword ptr [edx+68] mov dword ptr [ebp-528],eax`
My equivalent, hand-optimized, code

Code: [Select]
`mov   edx, [ebx * 4 + esi]mov   sort_wptr, edxmov   eax, [edx + offset lift_list.trip]mov   sort_trip, eaxmov   eax, [edx + offset lift_list.s_cnty]mov   sort_cnty, eaxmov   eax, [edx + offset lift_list.date.dyear]mov   sort_year, eaxmov   eax, [edx + offset lift_list.wtime]mov   sort_wtime, eax`
Common sub-expression elimination? No...
Register variables? No...

Other missed optimizations?

Using FISTTP for truncation? No...
Using FWAIT AD 2017 on a CPU that supports AVX? Ouch...

and code like this

Code: [Select]
`; [715] inc(_minmax[0].max.km,   ltd_ptr^.dtv.km); mov eax,dword ptr [dword ptr TC_\$HHCOMMON_\$\$_LTD_PTR] mov eax,dword ptr [eax+20] add dword ptr [dword ptr U_\$HHCOMMON_\$\$__MINMAX+8],eax; [716] inc(_minmax[0].max.time, ltd_ptr^.dtv.time); mov eax,dword ptr [dword ptr TC_\$HHCOMMON_\$\$_LTD_PTR] mov eax,dword ptr [eax+24] add dword ptr [dword ptr U_\$HHCOMMON_\$\$__MINMAX+12],eax`is screaming out for MMX/XMM conversion.

Sigh... or in other words, unless I'm doing something very wrong, I'm pretty disappointed in FPC's optimizing capabilities...

• Hero Member
• Posts: 8912
##### Re: A little ASM output for anyone doubting FPCs optimization capabilities...
« Reply #4 on: July 02, 2017, 10:11:35 pm »
FPC is just conservative in its default settings. You have to specify the desired optimizations, although -O4 will turn most, but not all of them on.
If you want AVX code for example, you will need to specify that. If it is vector code, specify vector. (-CfAVX or -CfAVX2 resp. -Sv)
A generic compiler usually optimizes worse case specified. Because the output needs to run on a worse case specced computer too.

So back to testing, research the optimization options test with -s and examine the output.
Btw: first checkout the docs on CSE... that is working... So before you accuse, first read the ff'ing docs, like here https://www.freepascal.org/docs-html/prog/progsu58.html or here https://www.freepascal.org/docs-html/user/userap1.html or....
The rest can also be found in the docs.

Provided you specify and optimize for the right processor and FPU instruction set FPC can do a very good job. Just not by default, for the reasons mentioned.
There is room for improvement, but half of what you want can be done and will be done already.
« Last Edit: July 02, 2017, 10:26:02 pm by Thaddy »
Most people that want to use threading should learn to patch their jeans first: use a needle.

#### Akira1364

• Hero Member
• Posts: 530
##### Re: A little ASM output for anyone doubting FPCs optimization capabilities...
« Reply #5 on: July 02, 2017, 10:46:47 pm »
Very, very,very obviously my version of LIFT (in lift32bit.rar)

That file doesn't exist in the dropbox folder.

#### ykot

• Full Member
• Posts: 141
##### Re: A little ASM output for anyone doubting FPCs optimization capabilities...
« Reply #6 on: July 02, 2017, 11:02:59 pm »
You might want to check the following discussion. For vector/matrix operations, unless you use -Sv and manually tune the math code, GCC/LLVM produces much more efficient output.

Also, for C++ comparisons, you might want online compiler, which includes GCC and LLVM: https://godbolt.org

By the way, Clang/LLVM produces very nice vector sum:

Code: [Select]
`struct Vector{  float x;  float y;  float z;  Vector operator + (Vector const& vector) const;};Vector Vector::operator + (Vector const& vector) const{  return {x + vector.x, y + vector.y, z + vector.z};}`
Resulting compiled code is:
Code: [Select]
`Vector::operator+(Vector const&) const:                     # @Vector::operator+(Vector const&) const        movsd   xmm1, qword ptr [rdi]   # xmm1 = mem[0],zero        movsd   xmm0, qword ptr [rsi]   # xmm0 = mem[0],zero        addps   xmm0, xmm1        movss   xmm1, dword ptr [rdi + 8] # xmm1 = mem[0],zero,zero,zero        addss   xmm1, dword ptr [rsi + 8]        ret`
As you see, it's much different than what you've posted initially.

Here's clicky to try yourself. So yes, it would be great to implement such compiler optimizations in FPC some day, perhaps with its LLVM backend?

#### Akira1364

• Hero Member
• Posts: 530
##### Re: A little ASM output for anyone doubting FPCs optimization capabilities...
« Reply #7 on: July 03, 2017, 12:43:22 am »
For vector/matrix operations, unless you use -Sv and manually tune the math code, GCC/LLVM produces much more efficient output.

I realized later that I actually did have -Sv set as well (I almost always do) and edited my initial post to reflect that. I still stand by my point that FPC is really "getting there" as far as optimization these days, though.