Recent

Author Topic: Compare code optimization of C and FreePascal !  (Read 17016 times)

ykot

  • Full Member
  • ***
  • Posts: 141
Re: Compare code optimization of C and FreePascal !
« Reply #30 on: March 22, 2017, 02:57:21 am »
Martin_fr, many thanks for the directions to look into, I'm definitely going to try this over the weekend and see if can get the resulting code smaller.

Graeme

  • Hero Member
  • *****
  • Posts: 1430
    • Graeme on the web
Re: Compare code optimization of C and FreePascal !
« Reply #31 on: March 22, 2017, 10:24:56 am »
The equivalent C++ code produces around 53 instructions vs 224 from FPC (more than 4 times smaller and likely similarly faster):
From my experience (using FPC since 2005), FPC is a cross-platform compiler, and it is apparently relatively easy to extend to new platforms due to the way it is designed and written. It is also designed to be easy to maintain. With all that comes a trade-off, and that trade-off is performance. It is clear from many examples that FPC doesn't generate very good performing code, but it does give you consistent cross-platform compilation support. Delphi, Kylix, GCC, CLang etc all run circles around FPC. This should not come as a surprise considering how small the development team is compared to GCC, Clang, or that Delphi and Kylix were designed with one CPU target in mind.

I'm not saying it is not possible for FPC to generate well optimised code - just don't hold your breath thinking it is coming soon. Other than that, enjoy the beautiful Object Pascal language and FPC's ability to support so many platforms and CPU targets. :)
--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8720
  • FPC developer.
Re: Compare code optimization of C and FreePascal !
« Reply #32 on: March 22, 2017, 10:55:16 am »

Aforementioned FreePascal assembly output has 16 multiplication instructions versus 8 multiplications in Clang output or only 4 multiplications in GCC output. If there would be a way to tweak the code to improve the assembly, it would be great.

There are 16 multiplicaties in the HLL code. That is the problem you give the compiler. So anything less should be from the compiler reforumlating the problem by vectorizing it, something that FPC doesn't do.  This should be evident from the documentation that doesn't mention that FPC vectorises, while CLang and GCC are documented to do so.

The only hope of getting improvement is to try to rewrite the code to use explicitely SSE registers (unit mmx/sse), and I know FPC has some functionality there, but I havent' played with. (since asm directly is usually better, and it requires rewriting the routines using some vector syntax anyway).

Or do I misunderstand your question and you want to start working on adding autovectorizing in FPC?

Thaddy

  • Hero Member
  • *****
  • Posts: 10446
Re: Compare code optimization of C and FreePascal !
« Reply #33 on: March 22, 2017, 11:06:43 am »
My experience is that with the -Sv option and a suitable -Cf, -Cp and -Op setting, current FPC does try. At least on ARM and X86_64. OTOH it is still trial and error and needs examining the assembler output (-a) so see what is actually going on. Sometimes I am pleasantly surprised.
When you ask a question that is actually answered in the documentation, you are either lazy or a moron.

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8720
  • FPC developer.
Re: Compare code optimization of C and FreePascal !
« Reply #34 on: March 22, 2017, 02:15:40 pm »
I tried ykot's source on 64-bit with -Sv and haven't seen any vectorization.  I also tried changing his vector record to array (in case it only knows arrays).

It is possible of course that it only works for   x+*/-y  where x and y are simple array types, not using matrix transformations like this.

(  fpc -al -Sd -O4 -Sv -Cfavx2 -Cpcoreavx2 -Opcoreavx1 <example>.pp  14 instructions per line, so 60-65 instructions in total.

Code: [Select]
# [12] Result[0] := V[0] * M.M[0, 0] + V[1] * M.M[1, 0] + V[2] * M.M[2, 0] + V[3] * M.M[3, 0];
vmovss (%rcx),%xmm0
vmulss (%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 16(%r9),%xmm0,%xmm0

vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 32(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1

vmovss 12(%rcx),%xmm0
vmulss 48(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,(%rax)
« Last Edit: March 22, 2017, 02:20:39 pm by marcov »

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8720
  • FPC developer.
Re: Compare code optimization of C and FreePascal !
« Reply #35 on: March 22, 2017, 04:14:06 pm »
You can simply multiply vectors. 4 muls using a transposed matrix but still is not superoptimal since there is no sum(vector). I'll ask on core if I can get a lists of operators supported by -Sv

Code: Pascal  [Select][+][-]
  1. //64-bit win64 cmdline:  fpc -al -Sd -O4 -Sv -Cfavx2 -Cpcoreavx2 -Opcoreavx2 bla.pp
  2.  
  3. type
  4.   TVector = array[0..3] of single;
  5.  
  6.   TMatrix = record
  7.     M: array[0..3] of TVector;
  8.   end;
  9.  
  10. function Transform(const V: TVector; const M: TMatrix): TVector;
  11. var r1,p2: TVector;
  12.      
  13.      i:integer;
  14. begin
  15.   r1:=v*m.m[0];
  16.   Result[0] := r1[0]+r1[1]+r1[2]+r1[3];
  17.   r1:=v*m.m[1];
  18.   Result[1] := r1[0]+r1[1]+r1[2]+r1[3];  
  19.   r1:=v*m.m[2];
  20.   Result[2] := r1[0]+r1[1]+r1[2]+r1[3];
  21.   r1:=v*m.m[3];
  22.   Result[3] := r1[0]+r1[1]+r1[2]+r1[3];
  23. end;
  24.  
  25. var vx,vy : TVector;
  26.    vm : TMatrix;
  27. begin
  28.  vy:=transform(vx,vm);
  29.  writeln(vy[0]);
  30. end.

ykot

  • Full Member
  • ***
  • Posts: 141
Re: Compare code optimization of C and FreePascal !
« Reply #36 on: March 22, 2017, 05:28:16 pm »
Yes, -Sv option, this is what I was looking for and it is something that I've read some time ago, thanks! Changing records to arrays is not a problem as it can be done via typecasting anyway. Also, would be interesting to see what code FPC produces for arm/aarch64 (at the moment  Idon't have these targets installed) as these might be more straightforward than x86.

Thaddy

  • Hero Member
  • *****
  • Posts: 10446
Re: Compare code optimization of C and FreePascal !
« Reply #37 on: March 22, 2017, 05:43:10 pm »
I tested only armhf -vfpv4 and intel families'. Note my remarks: it is not really mature, but sometimes I am surprised. I hope Marco's question can shed some light on what is actually supported.
When you ask a question that is actually answered in the documentation, you are either lazy or a moron.

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8720
  • FPC developer.
Re: Compare code optimization of C and FreePascal !
« Reply #38 on: March 23, 2017, 03:27:38 pm »
I have no answer yet, but in the source of the compiler (nadd.pas) I find:

Code: Pascal  [Select][+][-]
  1.      else if (cs_support_vectors in current_settings.globalswitches) and
  2.                  is_vector(ld) and
  3.                  is_vector(rd) and
  4.                  equal_defs(ld,rd) then
  5.             begin
  6.               if not(nodetype in [addn,subn,xorn,orn,andn,muln,slashn]) then
  7.                 CGMessage3(type_e_operator_not_supported_for_types,node2opstr(nodetype),ld.typename,rd.
  8.               { both defs must be equal, so taking left or right as resultdef doesn't matter }
  9.               resultdef:=left.resultdef;
  10.             end
  11.  

leads me to belief that it is +-/* xor/or/and  (and not not?)

Btw, maybe the code I posted yesterday can be rearranged so that one could use something like:
Code: Pascal  [Select][+][-]
  1.       r1:=v*m.m[0];
  2.       r2:=v*m.m[1];
  3.       r3=v*m.m[2];
  4.       r4:=v*m.m[3];
  5.       r1:=r1+r2;
  6.       r3:=r3+r4;
  7.       r1:=r1+r3;
  8.  

Note that this is indication only, it could be that it requires transposing the incoming matrix.

Thaddy

  • Hero Member
  • *****
  • Posts: 10446
Re: Compare code optimization of C and FreePascal !
« Reply #39 on: March 23, 2017, 04:09:52 pm »
Good news for ykot is that nadd is part of the hlcg and therefor it is likely to work on his platforms too  8-)
Thanks for the pointer, Marco.
When you ask a question that is actually answered in the documentation, you are either lazy or a moron.

PascalDragon

  • Hero Member
  • *****
  • Posts: 2104
  • Compiler Developer
Re: Compare code optimization of C and FreePascal !
« Reply #40 on: March 25, 2017, 02:37:13 pm »
It's been a year, but retaking this original topic, I'm trying some different compiler options.

I'm compiling 4x4 matrix multiplication code with FreePascal 3.1.1 from the trunk, compiler flags (-a, -O4, -CpCoreI, -CfSSE42, -OpCoreI, -OoFASTMATH). For the following code:
Code: Pascal  [Select][+][-]
  1. type
  2.   TMatrix = record
  3.     M: array[0..3, 0..3] of Single;
  4.  
  5.     class operator Multiply(const A, B: TMatrix): TMatrix;
  6.   end;
  7.  
  8. class operator TMatrix.Multiply(const A, B: TMatrix): TMatrix;
  9. var
  10.   I, J: Integer;
  11. begin
  12.   for J := 0 to 3 do
  13.     for I := 0 to 3 do
  14.       Result.M[J, I] := (A.M[J, 0] * B.M[0, I]) + (A.M[J, 1] * B.M[1, I]) + (A.M[J, 2] * B.M[2, I]) +
  15.         (A.M[J, 3] * B.M[3, I]);
  16. end;
  17.  

The resulting assembly is:
Code: [Select]
.Lc1:
.seh_proc MAINFM$_$TMATRIX_$__$$_star$TMATRIX$TMATRIX$$TMATRIX
pushq %rbx
.seh_pushreg %rbx
.seh_endprologue
movl $-1,%ebx
.balign 8,0x90
.Lj5:
addl $1,%ebx
movl $-1,%r10d
.balign 8,0x90
.Lj8:
addl $1,%r10d
movq %r8,%rax
movl %r10d,%r9d
movl %ebx,%r11d
shlq $4,%r11
leaq (%rdx,%r11),%r11
movss (%r11),%xmm1
mulss (%rax,%r9,4),%xmm1
movl %r10d,%r9d
movss 4(%r11),%xmm0
mulss 16(%rax,%r9,4),%xmm0
addss %xmm1,%xmm0
movl %r10d,%r9d
movss 8(%r11),%xmm1
mulss 32(%rax,%r9,4),%xmm1
addss %xmm0,%xmm1
movl %r10d,%r9d
movss 12(%r11),%xmm0
mulss 48(%rax,%r9,4),%xmm0
addss %xmm1,%xmm0
movl %ebx,%eax
shlq $4,%rax
movl %r10d,%r9d
leaq (%rcx,%rax),%rax
movss %xmm0,(%rax,%r9,4)
cmpl $3,%r10d
jnge .Lj8
cmpl $3,%ebx
jnge .Lj5
popq %rbx
ret
.seh_endproc
.Lc2:
It appears that in aforementioned code, loops were not unrolled, so the assembly has 2 loops and elements are multiplied one at a time. I'm quite unhappy with that code.

Please note that the compiler does not unroll loops by default (except for MIPS in -O3 it seems, though I don't know why); you need to manually enable it with -OoLoopUnroll. But even then the contents of the loop you have are considered as too complex as only rather small loop contents will be unrolled (though one might argue that the compiler currently considers the loop content as more complex than it probably should).

ykot

  • Full Member
  • ***
  • Posts: 141
Re: Compare code optimization of C and FreePascal !
« Reply #41 on: March 26, 2017, 01:57:23 am »
Please note that the compiler does not unroll loops by default (except for MIPS in -O3 it seems, though I don't know why); you need to manually enable it with -OoLoopUnroll. But even then the contents of the loop you have are considered as too complex as only rather small loop contents will be unrolled (though one might argue that the compiler currently considers the loop content as more complex than it probably should).

-OoLoopUnroll does seem to do the job just fine, thanks for the tip!

Btw, maybe the code I posted yesterday can be rearranged so that one could use something like:
Code: Pascal  [Select][+][-]
  1.       r1:=v*m.m[0];
  2.       r2:=v*m.m[1];
  3.       r3=v*m.m[2];
  4.       r4:=v*m.m[3];
  5.       r1:=r1+r2;
  6.       r3:=r3+r4;
  7.       r1:=r1+r3;
  8.  

In the library the matrices are typically stored in row-major order, though column-major is common in legacy OpenGL; in either case, using transposed matrices is not a big issue.
The resulting assembly, however, looks amazing:

Code: [Select]
movq %rcx,%rax
movdqa (%rdx),%xmm0
mulps (%r8),%xmm0
movdqa (%rdx),%xmm0
mulps 16(%r8),%xmm0
movdqa (%rdx),%xmm0
mulps 32(%r8),%xmm0
movdqa (%rdx),%xmm0
mulps 48(%r8),%xmm0
movdqa (%rsp),%xmm0
addps 16(%rsp),%xmm0
movdqa 32(%rsp),%xmm0
addps 48(%rsp),%xmm0
movdqa (%rsp),%xmm0
addps 32(%rsp),%xmm0
vmovups (%rsp),%xmm0
vmovups %xmm0,(%rax)
leaq 72(%rsp),%rsp
ret

I have yet to check if the actual math is right, but it looks great, many thanks!

By the way, I'm not sure if "-Sv" (with the appropriate example that multiplies 4-sized arrays as if they were vectors) and "-OoLoopUnroll" are documented somewhere, because they seem to be very useful options.

Akira1364

  • Hero Member
  • *****
  • Posts: 542
Re: Compare code optimization of C and FreePascal !
« Reply #42 on: March 26, 2017, 03:24:58 am »
Personally I always use the following set of flags in the "custom options" menu of Lazarus whenever I want what I call a super-release build:
-CfAVX2 -CpCOREAVX2 -g- OpCOREAVX2 -Si -Sv
I also usually set -O4, -CX and -XX in the "Compilation And Linking" menu, and -Xs in the "Debugging" menu.

The resulting assembly code I typically get from all of that (when I bother to look at it, which isn't that often) is not too shabby, honestly. The compiler is pretty good at knowing when to use the AVX/AVX2 vector versions of various math instructions (IE, VMULSS instead of MULSS, e.t.c) Also, I find that passing as many record types as "constref" (not just "const" as it doesn't always work) as you possibly can helps knock down the number of lines of generated ASM quite a bit as well.
« Last Edit: April 09, 2017, 08:42:06 pm by Akira1364 »

 

TinyPortal © 2005-2018