A little ASM output for anyone doubting FPCs optimization capabilities...

Akira1364

Hero Member
Posts: 561

A little ASM output for anyone doubting FPCs optimization capabilities...

« on: July 01, 2017, 10:04:24 pm »

Just thought I'd post this here as I've been very impressed with the trunk compilers code optimizer recently.

The relevant Pascal code:

Code: Pascal [Select][+]

TVertex = record
  X, Y, Z: Single;
end;
 
function VertexAdd(constref V1, V2: TVertex): TVertex;
begin
  with Result do
  begin
    X := V1.X + V2.X;
    Y := V1.Y + V2.Y;
    Z := V1.Z + V2.Z;
  end;
end;

The relevant C code (using the JavaScript tags because they have the closest syntax highlighter):

Code: Javascript [Select][+]

struct TVertex
{
    float X, Y, Z;
};
 
struct TVertex VertexAdd(const struct TVertex V1, const struct TVertex V2)
{
    struct TVertex Result;
    Result.X = V1.X + V2.X;
    Result.Y = V1.Y + V2.Y;
    Result.Z = V1.Z + V2.Z;
    return Result;
}

Trunk FPC ASM output (compiled with -CfAVX2 -CpCOREAVX2 -g- -O4 -OpCOREAVX2 -Sv):

Code: Pascal [Select][+]

.section .text.n_unit1_$$_vertexadd$tvertex$tvertex$$tvertex,"x"
        .balign 16,0x90
.globl  UNIT1_$$_VERTEXADD$TVERTEX$TVERTEX$$TVERTEX
UNIT1_$$_VERTEXADD$TVERTEX$TVERTEX$$TVERTEX:
.Lc1:
        movq    %rcx,%rax
        vmovss  (%rdx),%xmm0
        vaddss  (%r8),%xmm0,%xmm0
        vmovss  %xmm0,(%rax)
        vmovss  4(%rdx),%xmm0
        vaddss  4(%r8),%xmm0,%xmm0
        vmovss  %xmm0,4(%rax)
        vmovss  8(%rdx),%xmm0
        vaddss  8(%r8),%xmm0,%xmm0
        vmovss  %xmm0,8(%rax)
        ret
.Lc2:

GCC 7.1 ASM output (compiled with -g0 -march=haswell -mtune=haswell -mavx2 -Ofast -s):

Code: Pascal [Select][+]

        .globl  VertexAdd
        .def    VertexAdd;      .scl    2;      .type   32;     .endef
        .seh_proc       VertexAdd
VertexAdd:
        .seh_endprologue
        vmovss  4(%rdx), %xmm0
        vmovss  (%rdx), %xmm2
        vaddss  4(%r8), %xmm0, %xmm1
        vaddss  (%r8), %xmm2, %xmm2
        vmovss  8(%rdx), %xmm0
        vaddss  8(%r8), %xmm0, %xmm0
        movq    %rcx, %rax
        vmovss  %xmm2, (%rcx)
        vmovss  %xmm1, 4(%rcx)
        vmovss  %xmm0, 8(%rcx)
        ret
        .seh_endproc

Essentially exactly the same code, just in a slightly different order, which demonstrates that FPC can and should be perfectly comparable with C/C++ when given input that is directly one-to-one-mappable with them. Very cool IMO!

« Last Edit: July 03, 2017, 12:39:45 am by Akira1364 »

Logged

ASerge

Hero Member
Posts: 2246

Re: A little ASM output for anyone doubting FPCs optimization capabilities...

« Reply #1 on: July 01, 2017, 11:30:03 pm »

Quote from: Akira1364 on July 01, 2017, 10:04:24 pm

...which demonstrates that FPC can and should be perfectly comparable with C/C++ when given directly one-to-one-mappable input. Very cool IMO!

There are still challenges for development.
Suppose that only two calculation pipelines, and denote 1 - the execution cycle, + is parallel calculation with the previous command.
FPC

Code: [Select]

1        movq    %rcx,%rax
+        vmovss  (%rdx),%xmm0
1        vaddss  (%r8),%xmm0,%xmm0
1        vmovss  %xmm0,(%rax)
1        vmovss  4(%rdx),%xmm0
1        vaddss  4(%r8),%xmm0,%xmm0
1        vmovss  %xmm0,4(%rax)
1        vmovss  8(%rdx),%xmm0
1        vaddss  8(%r8),%xmm0,%xmm0
1        vmovss  %xmm0,8(%rax)

Equal 9.

GCC

Code: [Select]

1        vmovss  4(%rdx), %xmm0
+        vmovss  (%rdx), %xmm2
1        vaddss  4(%r8), %xmm0, %xmm1
+        vaddss  (%r8), %xmm2, %xmm2
1        vmovss  8(%rdx), %xmm0
1        vaddss  8(%r8), %xmm0, %xmm0
1        movq    %rcx, %rax
+        vmovss  %xmm2, (%rcx)
1        vmovss  %xmm1, 4(%rcx)
+        vmovss  %xmm0, 8(%rcx)

Equal 6.
The number of registers involved is, of course, less, but is it so important?

Logged

marcov

Administrator
Hero Member
Posts: 11455
FPC developer.

Re: A little ASM output for anyone doubting FPCs optimization capabilities...

« Reply #2 on: July 01, 2017, 11:35:14 pm »

One can check cycle counts with the Intel IACA tool.

Logged

prino

New Member
Posts: 14

Re: A little ASM output for anyone doubting FPCs optimization capabilities...

« Reply #3 on: July 02, 2017, 07:45:14 pm »

So I installed FPC again, for yet another try, as I'm getting a bit tired of DB coding MMX/SSE/AVX instructions in Virtual Pascal.

Very, very,very obviously my version of LIFT (in lift32bit.rar) that's nearly fully recoded into inline assembler does not compile, but after changing eight definitions of variables in hhcommon.pas from longint to word (for the DOS GetTime function), the "Pure Pascal" version compiles without problems.

However, it ABENDs (z/OS parlance for crash) in "readfile" with

RunTime Error 2130567168
Error address $00000000

Which is less than useful...

So let's just look a bit at the generated code...

My benchmark is a routine that does a Shellsort on an array of pointers, comparing two variables in the list items the pointers point to. The code generated for this, enabling all optimizations in the IDE makes me very sad, as it's hardly better than what Turbo Pascal used to generate more than three decades ago! Obviously I don't expect it the code to come anywhere close to manually optimized code, but this code generated doesn't even try:

Code: [Select]

; [256] sort_wptr := wait^[_i];
		mov	eax,dword ptr [ebp-544]
		mov	edx,dword ptr [eax]
		mov	eax,dword ptr [ebp-524]
		mov	eax,dword ptr [edx+eax*4-4]
		mov	dword ptr [ebp-540],eax
; [258] sort_trip := wait^[_i]^.trip;
		mov	eax,dword ptr [ebp-544]
		mov	edx,dword ptr [eax]
		mov	eax,dword ptr [ebp-524]
		mov	edx,dword ptr [edx+eax*4-4]
		mov	eax,dword ptr [edx+8]
		mov	dword ptr [ebp-532],eax
; [259] sort_cnty := wait^[_i]^.s_cnty;
		mov	eax,dword ptr [ebp-544]
		mov	ecx,dword ptr [eax]
		mov	edx,dword ptr [ebp-524]
		mov	eax,dword ptr [ecx+edx*4-4]
		mov	eax,dword ptr [eax+64]
		mov	dword ptr [ebp-4],eax
; [260] sort_year := wait^[_i]^.date.dyear;
		mov	eax,dword ptr [ebp-544]
		mov	edx,dword ptr [eax]
		mov	eax,dword ptr [ebp-524]
		mov	edx,dword ptr [edx+eax*4-4]
		mov	eax,dword ptr [edx+104]
		mov	dword ptr [ebp-536],eax
; [261] sort_wtime:= wait^[_i]^.wtime;
		mov	eax,dword ptr [ebp-544]
		mov	edx,dword ptr [eax]
		mov	eax,dword ptr [ebp-524]
		mov	edx,dword ptr [edx+eax*4-4]
		mov	eax,dword ptr [edx+68]
		mov	dword ptr [ebp-528],eax

My equivalent, hand-optimized, code

Code: [Select]

mov   edx, [ebx * 4 + esi]
mov   sort_wptr, edx

mov   eax, [edx + offset lift_list.trip]
mov   sort_trip, eax

mov   eax, [edx + offset lift_list.s_cnty]
mov   sort_cnty, eax

mov   eax, [edx + offset lift_list.date.dyear]
mov   sort_year, eax

mov   eax, [edx + offset lift_list.wtime]
mov   sort_wtime, eax

Common sub-expression elimination? No...
Register variables? No...

Other missed optimizations?

Using FISTTP for truncation? No...
Using FWAIT AD 2017 on a CPU that supports AVX? Ouch...

and code like this

Code: [Select]

; [715] inc(_minmax[0].max.km,   ltd_ptr^.dtv.km);
		mov	eax,dword ptr [dword ptr TC_$HHCOMMON_$$_LTD_PTR]
		mov	eax,dword ptr [eax+20]
		add	dword ptr [dword ptr U_$HHCOMMON_$$__MINMAX+8],eax
; [716] inc(_minmax[0].max.time, ltd_ptr^.dtv.time);
		mov	eax,dword ptr [dword ptr TC_$HHCOMMON_$$_LTD_PTR]
		mov	eax,dword ptr [eax+24]
		add	dword ptr [dword ptr U_$HHCOMMON_$$__MINMAX+12],eax

is screaming out for MMX/XMM conversion.

Sigh... or in other words, unless I'm doing something very wrong, I'm pretty disappointed in FPC's optimizing capabilities...

Logged

Thaddy

Hero Member
Posts: 14377
Sensorship about opinions does not belong here.

Re: A little ASM output for anyone doubting FPCs optimization capabilities...

« Reply #4 on: July 02, 2017, 10:11:35 pm »

FPC is just conservative in its default settings. You have to specify the desired optimizations, although -O4 will turn most, but not all of them on.
If you want AVX code for example, you will need to specify that. If it is vector code, specify vector. (-CfAVX or -CfAVX2 resp. -Sv)
A generic compiler usually optimizes worse case specified. Because the output needs to run on a worse case specced computer too.

So back to testing, research the optimization options test with -s and examine the output.

Btw: first checkout the docs on CSE... that is working... So before you accuse, first read the ff'ing docs, like here https://www.freepascal.org/docs-html/prog/progsu58.html or here https://www.freepascal.org/docs-html/user/userap1.html or....
The rest can also be found in the docs.

Provided you specify and optimize for the right processor and FPU instruction set FPC can do a very good job. Just not by default, for the reasons mentioned.
There is room for improvement, but half of what you want can be done and will be done already.

« Last Edit: July 02, 2017, 10:26:02 pm by Thaddy »

Logged

Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

Akira1364

Hero Member
Posts: 561

Re: A little ASM output for anyone doubting FPCs optimization capabilities...

« Reply #5 on: July 02, 2017, 10:46:47 pm »

Quote from: prino on July 02, 2017, 07:45:14 pm

Very, very,very obviously my version of LIFT (in lift32bit.rar)

That file doesn't exist in the dropbox folder.

Logged

ykot

Full Member
Posts: 141

Re: A little ASM output for anyone doubting FPCs optimization capabilities...

« Reply #6 on: July 02, 2017, 11:02:59 pm »

You might want to check the following discussion. For vector/matrix operations, unless you use -Sv and manually tune the math code, GCC/LLVM produces much more efficient output.

Also, for C++ comparisons, you might want online compiler, which includes GCC and LLVM: https://godbolt.org

By the way, Clang/LLVM produces very nice vector sum:

Code: [Select]

struct Vector
{
  float x;
  float y;
  float z;

  Vector operator + (Vector const& vector) const;
};

Vector Vector::operator + (Vector const& vector) const
{
  return {x + vector.x, y + vector.y, z + vector.z};
}

Resulting compiled code is:

Code: [Select]

Vector::operator+(Vector const&) const:                     # @Vector::operator+(Vector const&) const
        movsd   xmm1, qword ptr [rdi]   # xmm1 = mem[0],zero
        movsd   xmm0, qword ptr [rsi]   # xmm0 = mem[0],zero
        addps   xmm0, xmm1
        movss   xmm1, dword ptr [rdi + 8] # xmm1 = mem[0],zero,zero,zero
        addss   xmm1, dword ptr [rsi + 8]
        ret

As you see, it's much different than what you've posted initially.

Here's clicky to try yourself. So yes, it would be great to implement such compiler optimizations in FPC some day, perhaps with its LLVM backend?

Logged

Akira1364

Hero Member
Posts: 561

Re: A little ASM output for anyone doubting FPCs optimization capabilities...

« Reply #7 on: July 03, 2017, 12:43:22 am »

Quote from: ykot on July 02, 2017, 11:02:59 pm

For vector/matrix operations, unless you use -Sv and manually tune the math code, GCC/LLVM produces much more efficient output.

I realized later that I actually did have -Sv set as well (I almost always do) and edited my initial post to reflect that. I still stand by my point that FPC is really "getting there" as far as optimization these days, though.

Logged

Lazarus

Bookstore

Search

Recent

Author Topic: A little ASM output for anyone doubting FPCs optimization capabilities... (Read 4829 times)

Akira1364

A little ASM output for anyone doubting FPCs optimization capabilities...

ASerge

Re: A little ASM output for anyone doubting FPCs optimization capabilities...

marcov

Re: A little ASM output for anyone doubting FPCs optimization capabilities...

prino

Re: A little ASM output for anyone doubting FPCs optimization capabilities...

Thaddy

Re: A little ASM output for anyone doubting FPCs optimization capabilities...

Akira1364

Re: A little ASM output for anyone doubting FPCs optimization capabilities...

ykot

Re: A little ASM output for anyone doubting FPCs optimization capabilities...

Akira1364

Re: A little ASM output for anyone doubting FPCs optimization capabilities...

	Computer Math and Games in Pascal (preview)
	Lazarus Handbook