Recent

Author Topic: Compare code optimization of C and FreePascal !  (Read 15267 times)

rtusrghsdfhsfdhsdfhsfdhs

  • Full Member
  • ***
  • Posts: 162
Re: Compare code optimization of C and FreePascal !
« Reply #15 on: May 16, 2016, 01:43:39 pm »
If you want C performance and VCL library use C++ Builder.

That's bad advice. You mean any C/C++ compiler other than C++Builder when performance is an issue. MS Visual studio C /C++  (free), GNU compiler suite C/C++ parts (free), you name them...

C++ Builder now uses Clang for both win32 and 64.  8-)

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7350
Re: Compare code optimization of C and FreePascal !
« Reply #16 on: May 16, 2016, 01:45:35 pm »
Base on your posts I think we have 2 solution to speed up applications.
1. Add some parts to fpc to compile code to c insted of assembly to use power of gcc compiler.

That would not be a bad thing actually to have. Not that I think it would really work fine for everything, but it would be a good thing for specific cases. Please do!

Quote
2.Write speed limit part of application with C and build them as static library and link them with other parts

Well first find out if it really matters, because C is considerably more unwieldy and laborous than object pascal.

Moreover, there are also downsides(since you can't e.g. inline C code into pascal and vice versa, while you can inline pascal code into pascal and c into c)

Quote
We can use these solution until fpc optimization become power up.

The best way to power up is identify which bits hurt most and improve them in fpc. :P

ykot

  • Full Member
  • ***
  • Posts: 141
Re: Compare code optimization of C and FreePascal !
« Reply #17 on: May 16, 2016, 04:47:44 pm »
Templates are macro's in C++. So you can't ever call them. Unless you take the intermediate step of calling a c++ preprocessor. ...
Huh? Templates and Macros are two different concepts and, typically, templates are processed by the compiler itself.

Thaddy

  • Hero Member
  • *****
  • Posts: 8662
Re: Compare code optimization of C and FreePascal !
« Reply #18 on: May 16, 2016, 04:49:03 pm »
If you want C performance and VCL library use C++ Builder.

That's bad advice. You mean any C/C++ compiler other than C++Builder when performance is an issue. MS Visual studio C /C++  (free), GNU compiler suite C/C++ parts (free), you name them...

C++ Builder now uses Clang for both win32 and 64.  8-)

So you are not referring to C++ builder as a compiler but actually you mean as an IDE? (They used to do that) , but to Clang as a compiler? So you need to learn some more? As I told you..... Stop being stupid. Don't mention IDE's out of context. Geany might be the perfect compiler for you in that case :) >:D 8-)
Most people that want to use threading should learn to patch their jeans first: use a needle.

Thaddy

  • Hero Member
  • *****
  • Posts: 8662
Re: Compare code optimization of C and FreePascal !
« Reply #19 on: May 16, 2016, 04:52:58 pm »
Templates are macro's in C++. So you can't ever call them. Unless you take the intermediate step of calling a c++ preprocessor. ...
Huh? Templates and Macros are two different concepts and, typically, templates are processed by the compiler itself.

That's not true (different concepts,that is) . e.g. At least GNU C++ calls out to CPP to resolve templates. They are never resolved at runtime. And even: They are macro's per definition of the standard up until now.

The proof is btw that templates reside in header files. No code is generated. Think a kind of LISP macro, because it isn't exactly like a standard macro, but a second pass macro because types need to be resolved.
C++ still can't do that at runtime by means of RTTI. (Maybe in the future it can, but that was shot down by the committee that gave us C++11
« Last Edit: May 16, 2016, 05:09:21 pm by Thaddy »
Most people that want to use threading should learn to patch their jeans first: use a needle.

ykot

  • Full Member
  • ***
  • Posts: 141
Re: Compare code optimization of C and FreePascal !
« Reply #20 on: May 16, 2016, 05:10:19 pm »
That's not true. e.g. At least GNU C++ calls out to CPP to resolve templates. They are never resolved at runtime. And even: They are macro's per definition of the standard up until now.
What do you mean by CPP? C preprocessor? So you are saying, C preprocessor handles C++ templates? This sounds like something that would be quite difficult to achieve, so any links to support your statement? :) You might want to do the same with second proposition too.
Just by the fact that templates and macros are not resolved at runtime (and actually, templates, unless used for meta-programming, typically do involve compilation and runtime execution, whereas macros are handled by preprocessor and by the time actual compilation starts, they are conceptually gone) does not prove that "Templates are macro's in C++". So unless you provide some references, I'm calling out your BS. :)

Edit after your edit:
The proof is btw that templates reside in header files. No code is generated.
Nope, sorry. They are not necessarily need to be placed in headers, they can go in ".cpp" files too. And even so, it doesn't mean no code is generated. For instance, "std::string" is a class completely made via templates, but being a fully functional string class, it does involve some code generation, you know. :) And I doubt it's something that goes into C preprocessor...
« Last Edit: May 16, 2016, 05:15:20 pm by ykot »

Leledumbo

  • Hero Member
  • *****
  • Posts: 8103
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: Compare code optimization of C and FreePascal !
« Reply #21 on: May 16, 2016, 05:25:28 pm »
That's not true (different concepts,that is) . e.g. At least GNU C++ calls out to CPP to resolve templates. They are never resolved at runtime. And even: They are macro's per definition of the standard up until now.
This is half true. CPP is not called to resolve templates, g++ does it internally. You can try running g++ -E against a source code having template definition. However, you're correct that it's not resolved at runtime. Template gets resolved at compile time by g++ instead of cpp. Perhaps they did so in the past, I remember they were having problems with multiple template instantiations from the same set of template parameters. Probably they move the processing stage to overcome the problem.

ykot

  • Full Member
  • ***
  • Posts: 141
Re: Compare code optimization of C and FreePascal !
« Reply #22 on: May 16, 2016, 05:31:18 pm »
Thaddy, you probably meant that templates are similar to macros in a way they are processed before code generation begins, but they are not the same thing and unless we are talking about C++ compiler that generates C code as an intermediate step, templates are typically part of compilation stage, especially when dealing with functional and meta programming. Therefore, your preposition "Templates are macro's in C++" doesn't make sense. That's all to it.

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7350
Re: Compare code optimization of C and FreePascal !
« Reply #23 on: May 16, 2016, 06:27:31 pm »
I've heard the term "tokenreplay" for this kind of technique. Note that FPC native generics are afaik also token replay.

ykot

  • Full Member
  • ***
  • Posts: 141
Re: Compare code optimization of C and FreePascal !
« Reply #24 on: March 21, 2017, 08:03:11 pm »
It's been a year, but retaking this original topic, I'm trying some different compiler options.

I'm compiling 4x4 matrix multiplication code with FreePascal 3.1.1 from the trunk, compiler flags (-a, -O4, -CpCoreI, -CfSSE42, -OpCoreI, -OoFASTMATH). For the following code:
Code: Pascal  [Select]
  1. type
  2.   TMatrix = record
  3.     M: array[0..3, 0..3] of Single;
  4.  
  5.     class operator Multiply(const A, B: TMatrix): TMatrix;
  6.   end;
  7.  
  8. class operator TMatrix.Multiply(const A, B: TMatrix): TMatrix;
  9. var
  10.   I, J: Integer;
  11. begin
  12.   for J := 0 to 3 do
  13.     for I := 0 to 3 do
  14.       Result.M[J, I] := (A.M[J, 0] * B.M[0, I]) + (A.M[J, 1] * B.M[1, I]) + (A.M[J, 2] * B.M[2, I]) +
  15.         (A.M[J, 3] * B.M[3, I]);
  16. end;
  17.  

The resulting assembly is:
Code: [Select]
.Lc1:
.seh_proc MAINFM$_$TMATRIX_$__$$_star$TMATRIX$TMATRIX$$TMATRIX
pushq %rbx
.seh_pushreg %rbx
.seh_endprologue
movl $-1,%ebx
.balign 8,0x90
.Lj5:
addl $1,%ebx
movl $-1,%r10d
.balign 8,0x90
.Lj8:
addl $1,%r10d
movq %r8,%rax
movl %r10d,%r9d
movl %ebx,%r11d
shlq $4,%r11
leaq (%rdx,%r11),%r11
movss (%r11),%xmm1
mulss (%rax,%r9,4),%xmm1
movl %r10d,%r9d
movss 4(%r11),%xmm0
mulss 16(%rax,%r9,4),%xmm0
addss %xmm1,%xmm0
movl %r10d,%r9d
movss 8(%r11),%xmm1
mulss 32(%rax,%r9,4),%xmm1
addss %xmm0,%xmm1
movl %r10d,%r9d
movss 12(%r11),%xmm0
mulss 48(%rax,%r9,4),%xmm0
addss %xmm1,%xmm0
movl %ebx,%eax
shlq $4,%rax
movl %r10d,%r9d
leaq (%rcx,%rax),%rax
movss %xmm0,(%rax,%r9,4)
cmpl $3,%r10d
jnge .Lj8
cmpl $3,%ebx
jnge .Lj5
popq %rbx
ret
.seh_endproc
.Lc2:
It appears that in aforementioned code, loops were not unrolled, so the assembly has 2 loops and elements are multiplied one at a time. I'm quite unhappy with that code.

I've tried different version, which already has loops unrolled:
Code: Pascal  [Select]
  1. class operator TMatrix.Multiply(const A, B: TMatrix): TMatrix;
  2. begin
  3.   Result.M[0, 0] := A.M[0, 0] * B.M[0, 0] + A.M[0, 1] * B.M[1, 0] + A.M[0, 2] * B.M[2, 0] + A.M[0, 3] * B.M[3, 0];
  4.   Result.M[0, 1] := A.M[0, 0] * B.M[0, 1] + A.M[0, 1] * B.M[1, 1] + A.M[0, 2] * B.M[2, 1] + A.M[0, 3] * B.M[3, 1];
  5.   Result.M[0, 2] := A.M[0, 0] * B.M[0, 2] + A.M[0, 1] * B.M[1, 2] + A.M[0, 2] * B.M[2, 2] + A.M[0, 3] * B.M[3, 2];
  6.   Result.M[0, 3] := A.M[0, 0] * B.M[0, 3] + A.M[0, 1] * B.M[1, 3] + A.M[0, 2] * B.M[2, 3] + A.M[0, 3] * B.M[3, 3];
  7.   Result.M[1, 0] := A.M[1, 0] * B.M[0, 0] + A.M[1, 1] * B.M[1, 0] + A.M[1, 2] * B.M[2, 0] + A.M[1, 3] * B.M[3, 0];
  8.   Result.M[1, 1] := A.M[1, 0] * B.M[0, 1] + A.M[1, 1] * B.M[1, 1] + A.M[1, 2] * B.M[2, 1] + A.M[1, 3] * B.M[3, 1];
  9.   Result.M[1, 2] := A.M[1, 0] * B.M[0, 2] + A.M[1, 1] * B.M[1, 2] + A.M[1, 2] * B.M[2, 2] + A.M[1, 3] * B.M[3, 2];
  10.   Result.M[1, 3] := A.M[1, 0] * B.M[0, 3] + A.M[1, 1] * B.M[1, 3] + A.M[1, 2] * B.M[2, 3] + A.M[1, 3] * B.M[3, 3];
  11.   Result.M[2, 0] := A.M[2, 0] * B.M[0, 0] + A.M[2, 1] * B.M[1, 0] + A.M[2, 2] * B.M[2, 0] + A.M[2, 3] * B.M[3, 0];
  12.   Result.M[2, 1] := A.M[2, 0] * B.M[0, 1] + A.M[2, 1] * B.M[1, 1] + A.M[2, 2] * B.M[2, 1] + A.M[2, 3] * B.M[3, 1];
  13.   Result.M[2, 2] := A.M[2, 0] * B.M[0, 2] + A.M[2, 1] * B.M[1, 2] + A.M[2, 2] * B.M[2, 2] + A.M[2, 3] * B.M[3, 2];
  14.   Result.M[2, 3] := A.M[2, 0] * B.M[0, 3] + A.M[2, 1] * B.M[1, 3] + A.M[2, 2] * B.M[2, 3] + A.M[2, 3] * B.M[3, 3];
  15.   Result.M[3, 0] := A.M[3, 0] * B.M[0, 0] + A.M[3, 1] * B.M[1, 0] + A.M[3, 2] * B.M[2, 0] + A.M[3, 3] * B.M[3, 0];
  16.   Result.M[3, 1] := A.M[3, 0] * B.M[0, 1] + A.M[3, 1] * B.M[1, 1] + A.M[3, 2] * B.M[2, 1] + A.M[3, 3] * B.M[3, 1];
  17.   Result.M[3, 2] := A.M[3, 0] * B.M[0, 2] + A.M[3, 1] * B.M[1, 2] + A.M[3, 2] * B.M[2, 2] + A.M[3, 3] * B.M[3, 2];
  18.   Result.M[3, 3] := A.M[3, 0] * B.M[0, 3] + A.M[3, 1] * B.M[1, 3] + A.M[3, 2] * B.M[2, 3] + A.M[3, 3] * B.M[3, 3];
  19. end;
  20.  
The resulting assembly is:
Code: [Select]
movq %rcx,%rax
movq %rdx,%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss (%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 16(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 32(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 48(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,(%rax)
movq %rdx,%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss 4(%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 20(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 36(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 52(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,4(%rax)
movq %rdx,%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss 8(%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 24(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 40(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 56(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,8(%rax)
movq %rdx,%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss 12(%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 28(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 44(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 60(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,12(%rax)
leaq 16(%rdx),%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss (%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 16(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 32(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 48(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,16(%rax)
leaq 16(%rdx),%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss 4(%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 20(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 36(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 52(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,20(%rax)
leaq 16(%rdx),%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss 8(%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 24(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 40(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 56(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,24(%rax)
leaq 16(%rdx),%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss 12(%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 28(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 44(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 60(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,28(%rax)
leaq 32(%rdx),%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss (%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 16(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 32(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 48(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,32(%rax)
leaq 32(%rdx),%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss 4(%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 20(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 36(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 52(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,36(%rax)
leaq 32(%rdx),%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss 8(%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 24(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 40(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 56(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,40(%rax)
leaq 32(%rdx),%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss 12(%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 28(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 44(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 60(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,44(%rax)
leaq 48(%rdx),%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss (%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 16(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 32(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 48(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,48(%rax)
leaq 48(%rdx),%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss 4(%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 20(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 36(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 52(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,52(%rax)
leaq 48(%rdx),%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss 8(%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 24(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 40(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 56(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,56(%rax)
leaq 48(%rdx),%rdx
movss (%rdx),%xmm1
mulss 12(%r8),%xmm1
movss 4(%rdx),%xmm0
mulss 28(%r8),%xmm0
addss %xmm1,%xmm0
movss 8(%rdx),%xmm1
mulss 44(%r8),%xmm1
addss %xmm0,%xmm1
movss 12(%rdx),%xmm0
mulss 60(%r8),%xmm0
addss %xmm1,%xmm0
movss %xmm0,60(%rax)
ret
The assembly code above is very long and painful. The question is, due to nature of the title of this topic, how to get some comparable code optimization to what can be achieved in C++.

For the equivalent C++ code, the compiler (clang) correctly unrolls the loops and multiplies several elements at a time with around 71 instructions vs 200+ from FPC.
C++ version with SSE4.2

Changing FPC compiler options to use AVX instructions (which is what I've tried first) doesn't improve much (-CpCoreAVX, -CfAVX, -OpCoreAVX, -OoFASTMATH, -O4):

Code: [Select]
movq %rcx,%rax
movq %rdx,%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss (%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 16(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 32(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 48(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,(%rax)
movq %rdx,%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss 4(%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 20(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 36(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 52(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,4(%rax)
movq %rdx,%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss 8(%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 24(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 40(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 56(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,8(%rax)
movq %rdx,%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss 12(%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 28(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 44(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 60(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,12(%rax)
leaq 16(%rdx),%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss (%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 16(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 32(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 48(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,16(%rax)
leaq 16(%rdx),%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss 4(%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 20(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 36(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 52(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,20(%rax)
leaq 16(%rdx),%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss 8(%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 24(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 40(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 56(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,24(%rax)
leaq 16(%rdx),%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss 12(%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 28(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 44(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 60(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,28(%rax)
leaq 32(%rdx),%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss (%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 16(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 32(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 48(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,32(%rax)
leaq 32(%rdx),%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss 4(%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 20(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 36(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 52(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,36(%rax)
leaq 32(%rdx),%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss 8(%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 24(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 40(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 56(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,40(%rax)
leaq 32(%rdx),%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss 12(%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 28(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 44(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 60(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,44(%rax)
leaq 48(%rdx),%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss (%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 16(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 32(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 48(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,48(%rax)
leaq 48(%rdx),%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss 4(%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 20(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 36(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 52(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,52(%rax)
leaq 48(%rdx),%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss 8(%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 24(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 40(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 56(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,56(%rax)
leaq 48(%rdx),%rdx
vmovss (%rdx),%xmm0
vmulss 12(%r8),%xmm0,%xmm1
vmovss 4(%rdx),%xmm0
vmulss 28(%r8),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rdx),%xmm0
vmulss 44(%r8),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rdx),%xmm0
vmulss 60(%r8),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,60(%rax)

The equivalent C++ code produces around 53 instructions vs 224 from FPC (more than 4 times smaller and likely similarly faster):
C++ version compiled with AVX

Is there a compiler flag that I can try to improve aforementioned results?
« Last Edit: March 21, 2017, 08:05:15 pm by ykot »

Thaddy

  • Hero Member
  • *****
  • Posts: 8662
Re: Compare code optimization of C and FreePascal !
« Reply #25 on: March 21, 2017, 08:37:01 pm »
It is all about a SPECIFIC C++ compiler versus a SPECIFIC pascal compiler.
That means nothing in the context of a language. Plz compare like for likes.

E.g. the FPC JVM backend is just as good as the JAVA backend because it is the same....
Which means that FPC compiled for JVM is just as fast as Java.,,,,

Get it out of your silly heads that code execution speed has ANYTHING to do at all with the higher level language. It just means that there is room for improvement for the FPC native backend compared to SOME but not ALL C++ compilers....

But cross-compiling to another high-level language isn't a bad, albeit intermediate, idea.
« Last Edit: March 21, 2017, 08:48:43 pm by Thaddy »
Most people that want to use threading should learn to patch their jeans first: use a needle.

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7350
Re: Compare code optimization of C and FreePascal !
« Reply #26 on: March 21, 2017, 08:46:20 pm »
If you care so much about this specific routine, simply convert the generated C++ routine to inline Pascal?

That's what it is there for.

Thaddy

  • Hero Member
  • *****
  • Posts: 8662
Re: Compare code optimization of C and FreePascal !
« Reply #27 on: March 21, 2017, 08:50:22 pm »
If you care so much about this specific routine, simply convert the generated C++ routine to inline Pascal?

That's what it is there for.
I use to examine assembler output and optimize based on what the compiler thinks... I am often wrong... More often than not.
Most people that want to use threading should learn to patch their jeans first: use a needle.

ykot

  • Full Member
  • ***
  • Posts: 141
Re: Compare code optimization of C and FreePascal !
« Reply #28 on: March 21, 2017, 09:52:42 pm »
It is all about a SPECIFIC C++ compiler versus a SPECIFIC pascal compiler.
That means nothing in the context of a language. Plz compare like for likes.
I've asked a concrete question backed up by evidence and on-topic to this thread. Note that Clang results are consistent with GCC and MSVC in sense that they are considerably more optimized than output produced by FreePascal - on the links I've provided you can change compilers/platforms (Clicky for MSVC 2017). Blind fanboyism in saying that "this is nothing" doesn't really help anyone. As a relief for you, the aforementioned code compiled to x64 in Delphi 10.1 is actually quite bigger than the one produced by FreePascal, but that's off-topic.

If you care so much about this specific routine, simply convert the generated C++ routine to inline Pascal?
I've got a whole cross-platform library full of routines like this for 2D, 3D and 4D vectors, 3x3 and 4x4 matrices and quaternions. Rewriting all methods to assembly for all platforms and target CPUs is way too much effort, so I was hoping there would be a compiler flag I missed, or maybe re-ordering the code somehow to help the compiler optimizing it. Any suggestions are definitely appreciated.

For instance, the following "vector by matrix" transformation code:
Code: Pascal  [Select]
  1. type
  2.   TVector = record
  3.     X, Y, Z, W: Single;
  4.   end;
  5.  
  6.   TMatrix = record
  7.     M: array[0..3, 0..3] of Single;
  8.   end;
  9.  
  10. function Transform(const V: TVector; const M: TMatrix): TVector;
  11. begin
  12.   Result.X := V.X * M.M[0, 0] + V.Y * M.M[1, 0] + V.Z * M.M[2, 0] + V.W * M.M[3, 0];
  13.   Result.Y := V.X * M.M[0, 1] + V.Y * M.M[1, 1] + V.Z * M.M[2, 1] + V.W * M.M[3, 1];
  14.   Result.Z:= V.X * M.M[0, 2] + V.Y * M.M[1, 2] + V.Z * M.M[2, 2] + V.W * M.M[3, 2];
  15.   Result.W:= V.X * M.M[0, 3] + V.Y * M.M[1, 3] + V.Z * M.M[2, 3] + V.W * M.M[3, 3];
  16. end;
  17.  

Results in the following assembly:
Code: [Select]
movq %rcx,%rax
movq %rdx,%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss (%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 16(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 32(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 48(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,(%rax)
movq %rdx,%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss 4(%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 20(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 36(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 52(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,4(%rax)
movq %rdx,%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss 8(%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 24(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 40(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 56(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,8(%rax)
vmovss (%rdx),%xmm0
vmulss 12(%r8),%xmm0,%xmm1
vmovss 4(%rdx),%xmm0
vmulss 28(%r8),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rdx),%xmm0
vmulss 44(%r8),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rdx),%xmm0
vmulss 60(%r8),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,12(%rax)
ret
Aforementioned FreePascal assembly output has 16 multiplication instructions versus 8 multiplications in Clang output or only 4 multiplications in GCC output. If there would be a way to tweak the code to improve the assembly, it would be great.
« Last Edit: March 21, 2017, 09:57:59 pm by ykot »

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 5566
    • wiki
Re: Compare code optimization of C and FreePascal !
« Reply #29 on: March 22, 2017, 01:18:14 am »
so I was hoping there would be a compiler flag I missed, or maybe re-ordering the code somehow to help the compiler optimizing it. Any suggestions are definitely appreciated.

Not aware of any flags....

Also all the below may depend on specific fpc versions, and results may change.

Look at http://bugs.freepascal.org/view.php?id=10275
I used all the flags that made a diff
I think I enabled SSE, as some instructions could use those register, but not sure.

As for register usage: I found that code like
Code: Pascal  [Select]
  1.  for i ..... do begin
  2. end;
  3. // next loop
  4. for i := ...
got better results if each loop was in a function of its own. (but again very dependent on fpc version)

All the other "optimizations" I made in that issue were based on coding experience from the 80ies...
that is instead of using the array,
a
  • [y]

calculate a pointer to the first entry
p = @a
  • [y]

and then increment it, for fields and rows. (look at the rewritten code)

Also some effort went into using as few variables in a loop as possible...

This will be some work to undertake.... Not sure if it helps you.

------------------
OFF TOPIC (and you may already know/do)
If you do matrix operations, there is a 2nd factor, that is algorithm and in which order elements are accessed (you have to google that)
eg if you access values from different rows, then that may mean that your cpu has a lot of cache misses. And that does cost time.
So you want to organize the order in which you access cells, to ensure that the data can be found in the cpu cache as often as possible (again google, I do not recall details, and it depends what you do in the matrix)

Also on that topic google "memory oriented coding" if interested