Print Page - Compare code optimization of C and FreePascal !

Programming => General => Topic started by: mohsenti on May 16, 2016, 10:56:30 am

Title: Compare code optimization of C and FreePascal !
Post by: mohsenti on May 16, 2016, 10:56:30 am

Hi,

I write some codes to find primes number between 2 and 1999 with c and free-pascal and compile them; execution time of codes are same but when compile C with -Ofast and Pascal with all optimization options C is too faster than free-pascal.
Why ?

you can see results in this git. (https://github.com/mohsenti/compare-programming-languages)

Title: Re: Compare code optimization of C and FreePascal !
Post by: BeniBela on May 16, 2016, 11:09:09 am

The C compilers have become incredible clever.

So you tested the source in the git?

In that case, the compiler is way too clever

Code: [Select]


int main() {
   int count=0;
   for(int i=2;i<2000;i++)
      if (isPrime(i)){a
         count++;
      }
   return 0;
}

You never read the value assigned to count, so it does not matter and the compiler starts optimizing it to:

Code: [Select]


int main() {
   for(int i=2;i<2000;i++)      isPrime(i);
   return 0;
}

But now the value of isPrime is not used, so it keeps optimizing:

Code: [Select]


int main() {
   for(int i=2;i<2000;i++)  ;
   return 0;
}

Now the loop is pointless, and the entire program is reduced to:

Code: [Select]


int main() {
   return 0;
}

And that is a very, very fast program

Title: Re: Compare code optimization of C and FreePascal !
Post by: mohsenti on May 16, 2016, 11:13:04 am

yes I know.
But why freepascal compiler doesn't optimize code to

Code: Pascal [Select][+]

begin
end.
 

Title: Re: Compare code optimization of C and FreePascal !
Post by: Awkward on May 16, 2016, 11:16:57 am

Try to remove "uses SysUtils" first

Title: Re: Compare code optimization of C and FreePascal !
Post by: marcov on May 16, 2016, 11:17:16 am

Quote from: mohsenti on May 16, 2016, 10:56:30 am

I write some codes to find primes number between 2 and 1999 with c and free-pascal and compile them; execution time of codes are same but when compile C with -Ofast and Pascal with all optimization options C is too faster than free-pascal.
Why ?

What about big companies paying for fulltime programmers working on gcc, while FPC is done by people in their spare time?

Also, FPC's focus is more applications than system (though it is usable for both).

The big problem with comparisons like this is that you are not only comparing languages but also compilers. You could try to get some perspective by using some minor C compilers with a similar background like FPC.

Title: Re: Compare code optimization of C and FreePascal !
Post by: marcov on May 16, 2016, 11:18:53 am

Quote from: mohsenti on May 16, 2016, 11:13:04 am

yes I know.
But why freepascal compiler doesn't optimize code to
Code: Pascal [Select][+][-]
begin
end.

?

Somebody has to do it and maintain it, while the program is pointless for any purpose except implementing *nix /usr/bin/true and /usr/bin/false.

Title: Re: Compare code optimization of C and FreePascal !
Post by: rtusrghsdfhsfdhsdfhsfdhs on May 16, 2016, 11:35:14 am

IMHO it would be better to improve interop with GCC. So we can call Classes, Templates etc.

Title: Re: Compare code optimization of C and FreePascal !
Post by: marcov on May 16, 2016, 11:37:05 am

Quote from: Fiji on May 16, 2016, 11:35:14 am

IMHO it would be better to improve interop with GCC. So we can call Classes, Templates etc.

You can't without pulling GCC into FPC. And then again for every new major version.

Title: Re: Compare code optimization of C and FreePascal !
Post by: rtusrghsdfhsfdhsdfhsfdhs on May 16, 2016, 11:40:40 am

FPC already uses some parts from GCC like linker I think. So might as well do the rest.

Title: Re: Compare code optimization of C and FreePascal !
Post by: Leledumbo on May 16, 2016, 12:24:41 pm

Quote from: Fiji on May 16, 2016, 11:40:40 am

FPC already uses some parts from GCC like linker I think. So might as well do the rest.

I wouldn't call binutils a part of GCC...

Title: Re: Compare code optimization of C and FreePascal !
Post by: rtusrghsdfhsfdhsdfhsfdhs on May 16, 2016, 12:31:10 pm

If you want C performance and VCL library use C++ Builder.

Title: Re: Compare code optimization of C and FreePascal !
Post by: marcov on May 16, 2016, 01:06:15 pm

Quote from: Fiji on May 16, 2016, 11:40:40 am

FPC already uses some parts from GCC like linker I think.

Over the years less and less. A trend that hopefully continues.

And as Leledumbo correctly states, binutils is not part of gcc, and is not related to the issue that C/C++ headers are notoriously hard to interface.

Title: Re: Compare code optimization of C and FreePascal !
Post by: Thaddy on May 16, 2016, 01:21:46 pm

Quote from: Fiji on May 16, 2016, 12:31:10 pm

If you want C performance and VCL library use C++ Builder.

That's bad advice. You mean any C/C++ compiler other than C++Builder when performance is an issue. MS Visual studio C /C++ (free), GNU compiler suite C/C++ parts (free), you name them...

Title: Re: Compare code optimization of C and FreePascal !
Post by: Thaddy on May 16, 2016, 01:28:45 pm

Quote from: Fiji on May 16, 2016, 11:35:14 am

IMHO it would be better to improve interop with GCC. So we can call Classes, Templates etc.

Templates are macro's in C++. So you can't ever call them. Unless you take the intermediate step of calling a c++ preprocessor. Which would kill performance of the compiler. Again, you have some silly idea's about how some things work. I suggest you take up my earlier advice and study a bit more. Frankly, you are silly. >:D 8-) O:-)

Title: Re: Compare code optimization of C and FreePascal !
Post by: mohsenti on May 16, 2016, 01:39:15 pm

Base on your posts I think we have 2 solution to speed up applications.
1. Add some parts to fpc to compile code to c insted of assembly to use power of gcc compiler.
2.Write speed limit part of application with C and build them as static library and link them with other parts

We can use these solution until fpc optimization become power up.

Title: Re: Compare code optimization of C and FreePascal !
Post by: rtusrghsdfhsfdhsdfhsfdhs on May 16, 2016, 01:43:39 pm

Quote from: Thaddy on May 16, 2016, 01:21:46 pm

Quote from: Fiji on May 16, 2016, 12:31:10 pm
If you want C performance and VCL library use C++ Builder.

That's bad advice. You mean any C/C++ compiler other than C++Builder when performance is an issue. MS Visual studio C /C++ (free), GNU compiler suite C/C++ parts (free), you name them...

C++ Builder now uses Clang for both win32 and 64. 8-)

Title: Re: Compare code optimization of C and FreePascal !
Post by: marcov on May 16, 2016, 01:45:35 pm

Quote from: mohsenti on May 16, 2016, 01:39:15 pm

Base on your posts I think we have 2 solution to speed up applications.
1. Add some parts to fpc to compile code to c insted of assembly to use power of gcc compiler.

That would not be a bad thing actually to have. Not that I think it would really work fine for everything, but it would be a good thing for specific cases. Please do!

Quote

2.Write speed limit part of application with C and build them as static library and link them with other parts

Well first find out if it really matters, because C is considerably more unwieldy and laborous than object pascal.

Moreover, there are also downsides(since you can't e.g. inline C code into pascal and vice versa, while you can inline pascal code into pascal and c into c)

Quote

We can use these solution until fpc optimization become power up.

The best way to power up is identify which bits hurt most and improve them in fpc. :P

Title: Re: Compare code optimization of C and FreePascal !
Post by: ykot on May 16, 2016, 04:47:44 pm

Quote from: Thaddy on May 16, 2016, 01:28:45 pm

Templates are macro's in C++. So you can't ever call them. Unless you take the intermediate step of calling a c++ preprocessor. ...

Huh? Templates and Macros are two different concepts and, typically, templates are processed by the compiler itself.

Title: Re: Compare code optimization of C and FreePascal !
Post by: Thaddy on May 16, 2016, 04:49:03 pm

Quote from: Fiji on May 16, 2016, 01:43:39 pm

Quote from: Thaddy on May 16, 2016, 01:21:46 pm
Quote from: Fiji on May 16, 2016, 12:31:10 pm
If you want C performance and VCL library use C++ Builder.

That's bad advice. You mean any C/C++ compiler other than C++Builder when performance is an issue. MS Visual studio C /C++ (free), GNU compiler suite C/C++ parts (free), you name them...

C++ Builder now uses Clang for both win32 and 64. 8-)

So you are not referring to C++ builder as a compiler but actually you mean as an IDE? (They used to do that) , but to Clang as a compiler? So you need to learn some more? As I told you..... Stop being stupid. Don't mention IDE's out of context. Geany might be the perfect compiler for you in that case :) >:D 8-)

Title: Re: Compare code optimization of C and FreePascal !
Post by: Thaddy on May 16, 2016, 04:52:58 pm

Quote from: ykot on May 16, 2016, 04:47:44 pm

Quote from: Thaddy on May 16, 2016, 01:28:45 pm
Templates are macro's in C++. So you can't ever call them. Unless you take the intermediate step of calling a c++ preprocessor. ...
Huh? Templates and Macros are two different concepts and, typically, templates are processed by the compiler itself.

That's not true (different concepts,that is) . e.g. At least GNU C++ calls out to CPP to resolve templates. They are never resolved at runtime. And even: They are macro's per definition of the standard up until now.

The proof is btw that templates reside in header files. No code is generated. Think a kind of LISP macro, because it isn't exactly like a standard macro, but a second pass macro because types need to be resolved.
C++ still can't do that at runtime by means of RTTI. (Maybe in the future it can, but that was shot down by the committee that gave us C++11

Title: Re: Compare code optimization of C and FreePascal !
Post by: ykot on May 16, 2016, 05:10:19 pm

Quote from: Thaddy on May 16, 2016, 04:52:58 pm

That's not true. e.g. At least GNU C++ calls out to CPP to resolve templates. They are never resolved at runtime. And even: They are macro's per definition of the standard up until now.

What do you mean by CPP? C preprocessor? So you are saying, C preprocessor handles C++ templates? This sounds like something that would be quite difficult to achieve, so any links to support your statement? :) You might want to do the same with second proposition too.
Just by the fact that templates and macros are not resolved at runtime (and actually, templates, unless used for meta-programming, typically do involve compilation and runtime execution, whereas macros are handled by preprocessor and by the time actual compilation starts, they are conceptually gone) does not prove that "Templates are macro's in C++". So unless you provide some references, I'm calling out your BS. :)

Edit after your edit:

Quote from: Thaddy on May 16, 2016, 04:52:58 pm

The proof is btw that templates reside in header files. No code is generated.

Nope, sorry. They are not necessarily need to be placed in headers, they can go in ".cpp" files too. And even so, it doesn't mean no code is generated. For instance, "std::string" is a class completely made via templates, but being a fully functional string class, it does involve some code generation, you know. :) And I doubt it's something that goes into C preprocessor...

Title: Re: Compare code optimization of C and FreePascal !
Post by: Leledumbo on May 16, 2016, 05:25:28 pm

Quote from: Thaddy on May 16, 2016, 04:52:58 pm

That's not true (different concepts,that is) . e.g. At least GNU C++ calls out to CPP to resolve templates. They are never resolved at runtime. And even: They are macro's per definition of the standard up until now.

This is half true. CPP is not called to resolve templates, g++ does it internally. You can try running g++ -E against a source code having template definition. However, you're correct that it's not resolved at runtime. Template gets resolved at compile time by g++ instead of cpp. Perhaps they did so in the past, I remember they were having problems with multiple template instantiations from the same set of template parameters. Probably they move the processing stage to overcome the problem.

Title: Re: Compare code optimization of C and FreePascal !
Post by: ykot on May 16, 2016, 05:31:18 pm

Thaddy, you probably meant that templates are similar to macros in a way they are processed before code generation begins, but they are not the same thing and unless we are talking about C++ compiler that generates C code as an intermediate step, templates are typically part of compilation stage, especially when dealing with functional and meta programming. Therefore, your preposition "Templates are macro's in C++" doesn't make sense. That's all to it.

Title: Re: Compare code optimization of C and FreePascal !
Post by: marcov on May 16, 2016, 06:27:31 pm

I've heard the term "tokenreplay" for this kind of technique. Note that FPC native generics are afaik also token replay.

Title: Re: Compare code optimization of C and FreePascal !
Post by: ykot on March 21, 2017, 08:03:11 pm

It's been a year, but retaking this original topic, I'm trying some different compiler options.

I'm compiling 4x4 matrix multiplication code with FreePascal 3.1.1 from the trunk, compiler flags (-a, -O4, -CpCoreI, -CfSSE42, -OpCoreI, -OoFASTMATH). For the following code:

Code: Pascal [Select][+]

type
  TMatrix = record
    M: array[0..3, 0..3] of Single;
 
    class operator Multiply(const A, B: TMatrix): TMatrix;
  end;
 
class operator TMatrix.Multiply(const A, B: TMatrix): TMatrix;
var
  I, J: Integer;
begin
  for J := 0 to 3 do
    for I := 0 to 3 do
      Result.M[J, I] := (A.M[J, 0] * B.M[0, I]) + (A.M[J, 1] * B.M[1, I]) + (A.M[J, 2] * B.M[2, I]) +
        (A.M[J, 3] * B.M[3, I]);
end;
 

The resulting assembly is:

Code: [Select]

.Lc1:
.seh_proc MAINFM$_$TMATRIX_$__$$_star$TMATRIX$TMATRIX$$TMATRIX
	pushq	%rbx
.seh_pushreg %rbx
.seh_endprologue
	movl	$-1,%ebx
	.balign 8,0x90
.Lj5:
	addl	$1,%ebx
	movl	$-1,%r10d
	.balign 8,0x90
.Lj8:
	addl	$1,%r10d
	movq	%r8,%rax
	movl	%r10d,%r9d
	movl	%ebx,%r11d
	shlq	$4,%r11
	leaq	(%rdx,%r11),%r11
	movss	(%r11),%xmm1
	mulss	(%rax,%r9,4),%xmm1
	movl	%r10d,%r9d
	movss	4(%r11),%xmm0
	mulss	16(%rax,%r9,4),%xmm0
	addss	%xmm1,%xmm0
	movl	%r10d,%r9d
	movss	8(%r11),%xmm1
	mulss	32(%rax,%r9,4),%xmm1
	addss	%xmm0,%xmm1
	movl	%r10d,%r9d
	movss	12(%r11),%xmm0
	mulss	48(%rax,%r9,4),%xmm0
	addss	%xmm1,%xmm0
	movl	%ebx,%eax
	shlq	$4,%rax
	movl	%r10d,%r9d
	leaq	(%rcx,%rax),%rax
	movss	%xmm0,(%rax,%r9,4)
	cmpl	$3,%r10d
	jnge	.Lj8
	cmpl	$3,%ebx
	jnge	.Lj5
	popq	%rbx
	ret
.seh_endproc
.Lc2:

It appears that in aforementioned code, loops were not unrolled, so the assembly has 2 loops and elements are multiplied one at a time. I'm quite unhappy with that code.

I've tried different version, which already has loops unrolled:

Code: Pascal [Select][+]

class operator TMatrix.Multiply(const A, B: TMatrix): TMatrix;
begin
  Result.M[0, 0] := A.M[0, 0] * B.M[0, 0] + A.M[0, 1] * B.M[1, 0] + A.M[0, 2] * B.M[2, 0] + A.M[0, 3] * B.M[3, 0];
  Result.M[0, 1] := A.M[0, 0] * B.M[0, 1] + A.M[0, 1] * B.M[1, 1] + A.M[0, 2] * B.M[2, 1] + A.M[0, 3] * B.M[3, 1];
  Result.M[0, 2] := A.M[0, 0] * B.M[0, 2] + A.M[0, 1] * B.M[1, 2] + A.M[0, 2] * B.M[2, 2] + A.M[0, 3] * B.M[3, 2];
  Result.M[0, 3] := A.M[0, 0] * B.M[0, 3] + A.M[0, 1] * B.M[1, 3] + A.M[0, 2] * B.M[2, 3] + A.M[0, 3] * B.M[3, 3];
  Result.M[1, 0] := A.M[1, 0] * B.M[0, 0] + A.M[1, 1] * B.M[1, 0] + A.M[1, 2] * B.M[2, 0] + A.M[1, 3] * B.M[3, 0];
  Result.M[1, 1] := A.M[1, 0] * B.M[0, 1] + A.M[1, 1] * B.M[1, 1] + A.M[1, 2] * B.M[2, 1] + A.M[1, 3] * B.M[3, 1];
  Result.M[1, 2] := A.M[1, 0] * B.M[0, 2] + A.M[1, 1] * B.M[1, 2] + A.M[1, 2] * B.M[2, 2] + A.M[1, 3] * B.M[3, 2];
  Result.M[1, 3] := A.M[1, 0] * B.M[0, 3] + A.M[1, 1] * B.M[1, 3] + A.M[1, 2] * B.M[2, 3] + A.M[1, 3] * B.M[3, 3];
  Result.M[2, 0] := A.M[2, 0] * B.M[0, 0] + A.M[2, 1] * B.M[1, 0] + A.M[2, 2] * B.M[2, 0] + A.M[2, 3] * B.M[3, 0];
  Result.M[2, 1] := A.M[2, 0] * B.M[0, 1] + A.M[2, 1] * B.M[1, 1] + A.M[2, 2] * B.M[2, 1] + A.M[2, 3] * B.M[3, 1];
  Result.M[2, 2] := A.M[2, 0] * B.M[0, 2] + A.M[2, 1] * B.M[1, 2] + A.M[2, 2] * B.M[2, 2] + A.M[2, 3] * B.M[3, 2];
  Result.M[2, 3] := A.M[2, 0] * B.M[0, 3] + A.M[2, 1] * B.M[1, 3] + A.M[2, 2] * B.M[2, 3] + A.M[2, 3] * B.M[3, 3];
  Result.M[3, 0] := A.M[3, 0] * B.M[0, 0] + A.M[3, 1] * B.M[1, 0] + A.M[3, 2] * B.M[2, 0] + A.M[3, 3] * B.M[3, 0];
  Result.M[3, 1] := A.M[3, 0] * B.M[0, 1] + A.M[3, 1] * B.M[1, 1] + A.M[3, 2] * B.M[2, 1] + A.M[3, 3] * B.M[3, 1];
  Result.M[3, 2] := A.M[3, 0] * B.M[0, 2] + A.M[3, 1] * B.M[1, 2] + A.M[3, 2] * B.M[2, 2] + A.M[3, 3] * B.M[3, 2];
  Result.M[3, 3] := A.M[3, 0] * B.M[0, 3] + A.M[3, 1] * B.M[1, 3] + A.M[3, 2] * B.M[2, 3] + A.M[3, 3] * B.M[3, 3];
end;
 

The resulting assembly is:

Code: [Select]

	movq	%rcx,%rax
	movq	%rdx,%rcx
	movq	%r8,%r9
	movss	(%rcx),%xmm1
	mulss	(%r9),%xmm1
	movss	4(%rcx),%xmm0
	mulss	16(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	8(%rcx),%xmm1
	mulss	32(%r9),%xmm1
	addss	%xmm0,%xmm1
	movss	12(%rcx),%xmm0
	mulss	48(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	%xmm0,(%rax)
	movq	%rdx,%rcx
	movq	%r8,%r9
	movss	(%rcx),%xmm1
	mulss	4(%r9),%xmm1
	movss	4(%rcx),%xmm0
	mulss	20(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	8(%rcx),%xmm1
	mulss	36(%r9),%xmm1
	addss	%xmm0,%xmm1
	movss	12(%rcx),%xmm0
	mulss	52(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	%xmm0,4(%rax)
	movq	%rdx,%rcx
	movq	%r8,%r9
	movss	(%rcx),%xmm1
	mulss	8(%r9),%xmm1
	movss	4(%rcx),%xmm0
	mulss	24(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	8(%rcx),%xmm1
	mulss	40(%r9),%xmm1
	addss	%xmm0,%xmm1
	movss	12(%rcx),%xmm0
	mulss	56(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	%xmm0,8(%rax)
	movq	%rdx,%rcx
	movq	%r8,%r9
	movss	(%rcx),%xmm1
	mulss	12(%r9),%xmm1
	movss	4(%rcx),%xmm0
	mulss	28(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	8(%rcx),%xmm1
	mulss	44(%r9),%xmm1
	addss	%xmm0,%xmm1
	movss	12(%rcx),%xmm0
	mulss	60(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	%xmm0,12(%rax)
	leaq	16(%rdx),%rcx
	movq	%r8,%r9
	movss	(%rcx),%xmm1
	mulss	(%r9),%xmm1
	movss	4(%rcx),%xmm0
	mulss	16(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	8(%rcx),%xmm1
	mulss	32(%r9),%xmm1
	addss	%xmm0,%xmm1
	movss	12(%rcx),%xmm0
	mulss	48(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	%xmm0,16(%rax)
	leaq	16(%rdx),%rcx
	movq	%r8,%r9
	movss	(%rcx),%xmm1
	mulss	4(%r9),%xmm1
	movss	4(%rcx),%xmm0
	mulss	20(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	8(%rcx),%xmm1
	mulss	36(%r9),%xmm1
	addss	%xmm0,%xmm1
	movss	12(%rcx),%xmm0
	mulss	52(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	%xmm0,20(%rax)
	leaq	16(%rdx),%rcx
	movq	%r8,%r9
	movss	(%rcx),%xmm1
	mulss	8(%r9),%xmm1
	movss	4(%rcx),%xmm0
	mulss	24(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	8(%rcx),%xmm1
	mulss	40(%r9),%xmm1
	addss	%xmm0,%xmm1
	movss	12(%rcx),%xmm0
	mulss	56(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	%xmm0,24(%rax)
	leaq	16(%rdx),%rcx
	movq	%r8,%r9
	movss	(%rcx),%xmm1
	mulss	12(%r9),%xmm1
	movss	4(%rcx),%xmm0
	mulss	28(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	8(%rcx),%xmm1
	mulss	44(%r9),%xmm1
	addss	%xmm0,%xmm1
	movss	12(%rcx),%xmm0
	mulss	60(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	%xmm0,28(%rax)
	leaq	32(%rdx),%rcx
	movq	%r8,%r9
	movss	(%rcx),%xmm1
	mulss	(%r9),%xmm1
	movss	4(%rcx),%xmm0
	mulss	16(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	8(%rcx),%xmm1
	mulss	32(%r9),%xmm1
	addss	%xmm0,%xmm1
	movss	12(%rcx),%xmm0
	mulss	48(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	%xmm0,32(%rax)
	leaq	32(%rdx),%rcx
	movq	%r8,%r9
	movss	(%rcx),%xmm1
	mulss	4(%r9),%xmm1
	movss	4(%rcx),%xmm0
	mulss	20(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	8(%rcx),%xmm1
	mulss	36(%r9),%xmm1
	addss	%xmm0,%xmm1
	movss	12(%rcx),%xmm0
	mulss	52(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	%xmm0,36(%rax)
	leaq	32(%rdx),%rcx
	movq	%r8,%r9
	movss	(%rcx),%xmm1
	mulss	8(%r9),%xmm1
	movss	4(%rcx),%xmm0
	mulss	24(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	8(%rcx),%xmm1
	mulss	40(%r9),%xmm1
	addss	%xmm0,%xmm1
	movss	12(%rcx),%xmm0
	mulss	56(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	%xmm0,40(%rax)
	leaq	32(%rdx),%rcx
	movq	%r8,%r9
	movss	(%rcx),%xmm1
	mulss	12(%r9),%xmm1
	movss	4(%rcx),%xmm0
	mulss	28(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	8(%rcx),%xmm1
	mulss	44(%r9),%xmm1
	addss	%xmm0,%xmm1
	movss	12(%rcx),%xmm0
	mulss	60(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	%xmm0,44(%rax)
	leaq	48(%rdx),%rcx
	movq	%r8,%r9
	movss	(%rcx),%xmm1
	mulss	(%r9),%xmm1
	movss	4(%rcx),%xmm0
	mulss	16(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	8(%rcx),%xmm1
	mulss	32(%r9),%xmm1
	addss	%xmm0,%xmm1
	movss	12(%rcx),%xmm0
	mulss	48(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	%xmm0,48(%rax)
	leaq	48(%rdx),%rcx
	movq	%r8,%r9
	movss	(%rcx),%xmm1
	mulss	4(%r9),%xmm1
	movss	4(%rcx),%xmm0
	mulss	20(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	8(%rcx),%xmm1
	mulss	36(%r9),%xmm1
	addss	%xmm0,%xmm1
	movss	12(%rcx),%xmm0
	mulss	52(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	%xmm0,52(%rax)
	leaq	48(%rdx),%rcx
	movq	%r8,%r9
	movss	(%rcx),%xmm1
	mulss	8(%r9),%xmm1
	movss	4(%rcx),%xmm0
	mulss	24(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	8(%rcx),%xmm1
	mulss	40(%r9),%xmm1
	addss	%xmm0,%xmm1
	movss	12(%rcx),%xmm0
	mulss	56(%r9),%xmm0
	addss	%xmm1,%xmm0
	movss	%xmm0,56(%rax)
	leaq	48(%rdx),%rdx
	movss	(%rdx),%xmm1
	mulss	12(%r8),%xmm1
	movss	4(%rdx),%xmm0
	mulss	28(%r8),%xmm0
	addss	%xmm1,%xmm0
	movss	8(%rdx),%xmm1
	mulss	44(%r8),%xmm1
	addss	%xmm0,%xmm1
	movss	12(%rdx),%xmm0
	mulss	60(%r8),%xmm0
	addss	%xmm1,%xmm0
	movss	%xmm0,60(%rax)
	ret

The assembly code above is very long and painful. The question is, due to nature of the title of this topic, how to get some comparable code optimization to what can be achieved in C++.

For the equivalent C++ code, the compiler (clang) correctly unrolls the loops and multiplies several elements at a time with around 71 instructions vs 200+ from FPC.
C++ version with SSE4.2 (https://godbolt.org/g/V8VqmA)

Changing FPC compiler options to use AVX instructions (which is what I've tried first) doesn't improve much (-CpCoreAVX, -CfAVX, -OpCoreAVX, -OoFASTMATH, -O4):

Code: [Select]

	movq	%rcx,%rax
	movq	%rdx,%rcx
	movq	%r8,%r9
	vmovss	(%rcx),%xmm0
	vmulss	(%r9),%xmm0,%xmm1
	vmovss	4(%rcx),%xmm0
	vmulss	16(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	8(%rcx),%xmm0
	vmulss	32(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	12(%rcx),%xmm0
	vmulss	48(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm0
	vmovss	%xmm0,(%rax)
	movq	%rdx,%rcx
	movq	%r8,%r9
	vmovss	(%rcx),%xmm0
	vmulss	4(%r9),%xmm0,%xmm1
	vmovss	4(%rcx),%xmm0
	vmulss	20(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	8(%rcx),%xmm0
	vmulss	36(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	12(%rcx),%xmm0
	vmulss	52(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm0
	vmovss	%xmm0,4(%rax)
	movq	%rdx,%rcx
	movq	%r8,%r9
	vmovss	(%rcx),%xmm0
	vmulss	8(%r9),%xmm0,%xmm1
	vmovss	4(%rcx),%xmm0
	vmulss	24(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	8(%rcx),%xmm0
	vmulss	40(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	12(%rcx),%xmm0
	vmulss	56(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm0
	vmovss	%xmm0,8(%rax)
	movq	%rdx,%rcx
	movq	%r8,%r9
	vmovss	(%rcx),%xmm0
	vmulss	12(%r9),%xmm0,%xmm1
	vmovss	4(%rcx),%xmm0
	vmulss	28(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	8(%rcx),%xmm0
	vmulss	44(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	12(%rcx),%xmm0
	vmulss	60(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm0
	vmovss	%xmm0,12(%rax)
	leaq	16(%rdx),%rcx
	movq	%r8,%r9
	vmovss	(%rcx),%xmm0
	vmulss	(%r9),%xmm0,%xmm1
	vmovss	4(%rcx),%xmm0
	vmulss	16(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	8(%rcx),%xmm0
	vmulss	32(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	12(%rcx),%xmm0
	vmulss	48(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm0
	vmovss	%xmm0,16(%rax)
	leaq	16(%rdx),%rcx
	movq	%r8,%r9
	vmovss	(%rcx),%xmm0
	vmulss	4(%r9),%xmm0,%xmm1
	vmovss	4(%rcx),%xmm0
	vmulss	20(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	8(%rcx),%xmm0
	vmulss	36(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	12(%rcx),%xmm0
	vmulss	52(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm0
	vmovss	%xmm0,20(%rax)
	leaq	16(%rdx),%rcx
	movq	%r8,%r9
	vmovss	(%rcx),%xmm0
	vmulss	8(%r9),%xmm0,%xmm1
	vmovss	4(%rcx),%xmm0
	vmulss	24(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	8(%rcx),%xmm0
	vmulss	40(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	12(%rcx),%xmm0
	vmulss	56(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm0
	vmovss	%xmm0,24(%rax)
	leaq	16(%rdx),%rcx
	movq	%r8,%r9
	vmovss	(%rcx),%xmm0
	vmulss	12(%r9),%xmm0,%xmm1
	vmovss	4(%rcx),%xmm0
	vmulss	28(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	8(%rcx),%xmm0
	vmulss	44(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	12(%rcx),%xmm0
	vmulss	60(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm0
	vmovss	%xmm0,28(%rax)
	leaq	32(%rdx),%rcx
	movq	%r8,%r9
	vmovss	(%rcx),%xmm0
	vmulss	(%r9),%xmm0,%xmm1
	vmovss	4(%rcx),%xmm0
	vmulss	16(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	8(%rcx),%xmm0
	vmulss	32(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	12(%rcx),%xmm0
	vmulss	48(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm0
	vmovss	%xmm0,32(%rax)
	leaq	32(%rdx),%rcx
	movq	%r8,%r9
	vmovss	(%rcx),%xmm0
	vmulss	4(%r9),%xmm0,%xmm1
	vmovss	4(%rcx),%xmm0
	vmulss	20(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	8(%rcx),%xmm0
	vmulss	36(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	12(%rcx),%xmm0
	vmulss	52(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm0
	vmovss	%xmm0,36(%rax)
	leaq	32(%rdx),%rcx
	movq	%r8,%r9
	vmovss	(%rcx),%xmm0
	vmulss	8(%r9),%xmm0,%xmm1
	vmovss	4(%rcx),%xmm0
	vmulss	24(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	8(%rcx),%xmm0
	vmulss	40(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	12(%rcx),%xmm0
	vmulss	56(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm0
	vmovss	%xmm0,40(%rax)
	leaq	32(%rdx),%rcx
	movq	%r8,%r9
	vmovss	(%rcx),%xmm0
	vmulss	12(%r9),%xmm0,%xmm1
	vmovss	4(%rcx),%xmm0
	vmulss	28(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	8(%rcx),%xmm0
	vmulss	44(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	12(%rcx),%xmm0
	vmulss	60(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm0
	vmovss	%xmm0,44(%rax)
	leaq	48(%rdx),%rcx
	movq	%r8,%r9
	vmovss	(%rcx),%xmm0
	vmulss	(%r9),%xmm0,%xmm1
	vmovss	4(%rcx),%xmm0
	vmulss	16(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	8(%rcx),%xmm0
	vmulss	32(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	12(%rcx),%xmm0
	vmulss	48(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm0
	vmovss	%xmm0,48(%rax)
	leaq	48(%rdx),%rcx
	movq	%r8,%r9
	vmovss	(%rcx),%xmm0
	vmulss	4(%r9),%xmm0,%xmm1
	vmovss	4(%rcx),%xmm0
	vmulss	20(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	8(%rcx),%xmm0
	vmulss	36(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	12(%rcx),%xmm0
	vmulss	52(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm0
	vmovss	%xmm0,52(%rax)
	leaq	48(%rdx),%rcx
	movq	%r8,%r9
	vmovss	(%rcx),%xmm0
	vmulss	8(%r9),%xmm0,%xmm1
	vmovss	4(%rcx),%xmm0
	vmulss	24(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	8(%rcx),%xmm0
	vmulss	40(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	12(%rcx),%xmm0
	vmulss	56(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm0
	vmovss	%xmm0,56(%rax)
	leaq	48(%rdx),%rdx
	vmovss	(%rdx),%xmm0
	vmulss	12(%r8),%xmm0,%xmm1
	vmovss	4(%rdx),%xmm0
	vmulss	28(%r8),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	8(%rdx),%xmm0
	vmulss	44(%r8),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	12(%rdx),%xmm0
	vmulss	60(%r8),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm0
	vmovss	%xmm0,60(%rax)

The equivalent C++ code produces around 53 instructions vs 224 from FPC (more than 4 times smaller and likely similarly faster):
C++ version compiled with AVX (https://godbolt.org/g/oyRhVG)

Is there a compiler flag that I can try to improve aforementioned results?

Title: Re: Compare code optimization of C and FreePascal !
Post by: Thaddy on March 21, 2017, 08:37:01 pm

It is all about a SPECIFIC C++ compiler versus a SPECIFIC pascal compiler.
That means nothing in the context of a language. Plz compare like for likes.

E.g. the FPC JVM backend is just as good as the JAVA backend because it is the same....
Which means that FPC compiled for JVM is just as fast as Java.,,,,

Get it out of your silly heads that code execution speed has ANYTHING to do at all with the higher level language. It just means that there is room for improvement for the FPC native backend compared to SOME but not ALL C++ compilers....

But cross-compiling to another high-level language isn't a bad, albeit intermediate, idea.

Title: Re: Compare code optimization of C and FreePascal !
Post by: marcov on March 21, 2017, 08:46:20 pm

If you care so much about this specific routine, simply convert the generated C++ routine to inline Pascal?

That's what it is there for.

Title: Re: Compare code optimization of C and FreePascal !
Post by: Thaddy on March 21, 2017, 08:50:22 pm

Quote from: marcov on March 21, 2017, 08:46:20 pm

If you care so much about this specific routine, simply convert the generated C++ routine to inline Pascal?

That's what it is there for.

I use to examine assembler output and optimize based on what the compiler thinks... I am often wrong... More often than not.

Title: Re: Compare code optimization of C and FreePascal !
Post by: ykot on March 21, 2017, 09:52:42 pm

Quote from: Thaddy on March 21, 2017, 08:37:01 pm

It is all about a SPECIFIC C++ compiler versus a SPECIFIC pascal compiler.
That means nothing in the context of a language. Plz compare like for likes.

I've asked a concrete question backed up by evidence and on-topic to this thread. Note that Clang results are consistent with GCC and MSVC in sense that they are considerably more optimized than output produced by FreePascal - on the links I've provided you can change compilers/platforms (Clicky for MSVC 2017 (https://godbolt.org/g/cNdK9O)). Blind fanboyism in saying that "this is nothing" doesn't really help anyone. As a relief for you, the aforementioned code compiled to x64 in Delphi 10.1 is actually quite bigger than the one produced by FreePascal, but that's off-topic.

Quote from: marcov on March 21, 2017, 08:46:20 pm

If you care so much about this specific routine, simply convert the generated C++ routine to inline Pascal?

I've got a whole cross-platform library full of routines like this for 2D, 3D and 4D vectors, 3x3 and 4x4 matrices and quaternions. Rewriting all methods to assembly for all platforms and target CPUs is way too much effort, so I was hoping there would be a compiler flag I missed, or maybe re-ordering the code somehow to help the compiler optimizing it. Any suggestions are definitely appreciated.

For instance, the following "vector by matrix" transformation code:

Code: Pascal [Select][+]

type
  TVector = record
    X, Y, Z, W: Single;
  end;
 
  TMatrix = record
    M: array[0..3, 0..3] of Single;
  end;
 
function Transform(const V: TVector; const M: TMatrix): TVector;
begin
  Result.X := V.X * M.M[0, 0] + V.Y * M.M[1, 0] + V.Z * M.M[2, 0] + V.W * M.M[3, 0];
  Result.Y := V.X * M.M[0, 1] + V.Y * M.M[1, 1] + V.Z * M.M[2, 1] + V.W * M.M[3, 1];
  Result.Z:= V.X * M.M[0, 2] + V.Y * M.M[1, 2] + V.Z * M.M[2, 2] + V.W * M.M[3, 2];
  Result.W:= V.X * M.M[0, 3] + V.Y * M.M[1, 3] + V.Z * M.M[2, 3] + V.W * M.M[3, 3];
end;
 

Results in the following assembly:

Code: [Select]

	movq	%rcx,%rax
	movq	%rdx,%rcx
	movq	%r8,%r9
	vmovss	(%rcx),%xmm0
	vmulss	(%r9),%xmm0,%xmm1
	vmovss	4(%rcx),%xmm0
	vmulss	16(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	8(%rcx),%xmm0
	vmulss	32(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	12(%rcx),%xmm0
	vmulss	48(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm0
	vmovss	%xmm0,(%rax)
	movq	%rdx,%rcx
	movq	%r8,%r9
	vmovss	(%rcx),%xmm0
	vmulss	4(%r9),%xmm0,%xmm1
	vmovss	4(%rcx),%xmm0
	vmulss	20(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	8(%rcx),%xmm0
	vmulss	36(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	12(%rcx),%xmm0
	vmulss	52(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm0
	vmovss	%xmm0,4(%rax)
	movq	%rdx,%rcx
	movq	%r8,%r9
	vmovss	(%rcx),%xmm0
	vmulss	8(%r9),%xmm0,%xmm1
	vmovss	4(%rcx),%xmm0
	vmulss	24(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	8(%rcx),%xmm0
	vmulss	40(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	12(%rcx),%xmm0
	vmulss	56(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm0
	vmovss	%xmm0,8(%rax)
	vmovss	(%rdx),%xmm0
	vmulss	12(%r8),%xmm0,%xmm1
	vmovss	4(%rdx),%xmm0
	vmulss	28(%r8),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	8(%rdx),%xmm0
	vmulss	44(%r8),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	12(%rdx),%xmm0
	vmulss	60(%r8),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm0
	vmovss	%xmm0,12(%rax)
	ret

Aforementioned FreePascal assembly output has 16 multiplication instructions versus 8 multiplications in Clang output (https://godbolt.org/g/WiFKwb) or only 4 multiplications in GCC output (https://godbolt.org/g/n3nGB5). If there would be a way to tweak the code to improve the assembly, it would be great.

Title: Re: Compare code optimization of C and FreePascal !
Post by: Martin_fr on March 22, 2017, 01:18:14 am

Quote from: ykot on March 21, 2017, 09:52:42 pm

so I was hoping there would be a compiler flag I missed, or maybe re-ordering the code somehow to help the compiler optimizing it. Any suggestions are definitely appreciated.

Not aware of any flags....

Also all the below may depend on specific fpc versions, and results may change.

Look at http://bugs.freepascal.org/view.php?id=10275
I used all the flags that made a diff
I think I enabled SSE, as some instructions could use those register, but not sure.

As for register usage: I found that code like

Code: Pascal [Select][+]

 for i ..... do begin
end;
// next loop
for i := ...

got better results if each loop was in a function of its own. (but again very dependent on fpc version)

All the other "optimizations" I made in that issue were based on coding experience from the 80ies...
that is instead of using the array,
a

calculate a pointer to the first entry
p = @a

and then increment it, for fields and rows. (look at the rewritten code)

Also some effort went into using as few variables in a loop as possible...

This will be some work to undertake.... Not sure if it helps you.

------------------
OFF TOPIC (and you may already know/do)
If you do matrix operations, there is a 2nd factor, that is algorithm and in which order elements are accessed (you have to google that)
eg if you access values from different rows, then that may mean that your cpu has a lot of cache misses. And that does cost time.
So you want to organize the order in which you access cells, to ensure that the data can be found in the cpu cache as often as possible (again google, I do not recall details, and it depends what you do in the matrix)

Also on that topic google "memory oriented coding" if interested

Title: Re: Compare code optimization of C and FreePascal !
Post by: ykot on March 22, 2017, 02:57:21 am

Martin_fr, many thanks for the directions to look into, I'm definitely going to try this over the weekend and see if can get the resulting code smaller.

Title: Re: Compare code optimization of C and FreePascal !
Post by: Graeme on March 22, 2017, 10:24:56 am

Quote from: ykot on March 21, 2017, 08:03:11 pm

The equivalent C++ code produces around 53 instructions vs 224 from FPC (more than 4 times smaller and likely similarly faster):

From my experience (using FPC since 2005), FPC is a cross-platform compiler, and it is apparently relatively easy to extend to new platforms due to the way it is designed and written. It is also designed to be easy to maintain. With all that comes a trade-off, and that trade-off is performance. It is clear from many examples that FPC doesn't generate very good performing code, but it does give you consistent cross-platform compilation support. Delphi, Kylix, GCC, CLang etc all run circles around FPC. This should not come as a surprise considering how small the development team is compared to GCC, Clang, or that Delphi and Kylix were designed with one CPU target in mind.

I'm not saying it is not possible for FPC to generate well optimised code - just don't hold your breath thinking it is coming soon. Other than that, enjoy the beautiful Object Pascal language and FPC's ability to support so many platforms and CPU targets. :)

Title: Re: Compare code optimization of C and FreePascal !
Post by: marcov on March 22, 2017, 10:55:16 am

Quote from: ykot on March 21, 2017, 09:52:42 pm

Aforementioned FreePascal assembly output has 16 multiplication instructions versus 8 multiplications in Clang output (https://godbolt.org/g/WiFKwb) or only 4 multiplications in GCC output (https://godbolt.org/g/n3nGB5). If there would be a way to tweak the code to improve the assembly, it would be great.

There are 16 multiplicaties in the HLL code. That is the problem you give the compiler. So anything less should be from the compiler reforumlating the problem by vectorizing it, something that FPC doesn't do. This should be evident from the documentation that doesn't mention that FPC vectorises, while CLang and GCC are documented to do so.

The only hope of getting improvement is to try to rewrite the code to use explicitely SSE registers (unit mmx/sse), and I know FPC has some functionality there, but I havent' played with. (since asm directly is usually better, and it requires rewriting the routines using some vector syntax anyway).

Or do I misunderstand your question and you want to start working on adding autovectorizing in FPC?

Title: Re: Compare code optimization of C and FreePascal !
Post by: Thaddy on March 22, 2017, 11:06:43 am

My experience is that with the -Sv option and a suitable -Cf, -Cp and -Op setting, current FPC does try. At least on ARM and X86_64. OTOH it is still trial and error and needs examining the assembler output (-a) so see what is actually going on. Sometimes I am pleasantly surprised.

Title: Re: Compare code optimization of C and FreePascal !
Post by: marcov on March 22, 2017, 02:15:40 pm

I tried ykot's source on 64-bit with -Sv and haven't seen any vectorization. I also tried changing his vector record to array (in case it only knows arrays).

It is possible of course that it only works for x+*/-y where x and y are simple array types, not using matrix transformations like this.

( fpc -al -Sd -O4 -Sv -Cfavx2 -Cpcoreavx2 -Opcoreavx1 <example>.pp 14 instructions per line, so 60-65 instructions in total.

Code: [Select]

# [12] Result[0] := V[0] * M.M[0, 0] + V[1] * M.M[1, 0] + V[2] * M.M[2, 0] + V[3] * M.M[3, 0];
	vmovss	(%rcx),%xmm0
	vmulss	(%r9),%xmm0,%xmm1
	vmovss	4(%rcx),%xmm0
	vmulss	16(%r9),%xmm0,%xmm0

	vaddss	%xmm1,%xmm0,%xmm1
	vmovss	8(%rcx),%xmm0
	vmulss	32(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm1

	vmovss	12(%rcx),%xmm0
	vmulss	48(%r9),%xmm0,%xmm0
	vaddss	%xmm1,%xmm0,%xmm0
	vmovss	%xmm0,(%rax)

Title: Re: Compare code optimization of C and FreePascal !
Post by: marcov on March 22, 2017, 04:14:06 pm

You can simply multiply vectors. 4 muls using a transposed matrix but still is not superoptimal since there is no sum(vector). I'll ask on core if I can get a lists of operators supported by -Sv

Code: Pascal [Select][+]

//64-bit win64 cmdline:  fpc -al -Sd -O4 -Sv -Cfavx2 -Cpcoreavx2 -Opcoreavx2 bla.pp
 
type
  TVector = array[0..3] of single; 
 
  TMatrix = record
    M: array[0..3] of TVector; 
  end;
 
function Transform(const V: TVector; const M: TMatrix): TVector;
var r1,p2: TVector;
     
     i:integer;
begin
  r1:=v*m.m[0];
  Result[0] := r1[0]+r1[1]+r1[2]+r1[3];
  r1:=v*m.m[1];
  Result[1] := r1[0]+r1[1]+r1[2]+r1[3];  
  r1:=v*m.m[2];
  Result[2] := r1[0]+r1[1]+r1[2]+r1[3];
  r1:=v*m.m[3];
  Result[3] := r1[0]+r1[1]+r1[2]+r1[3];
end;
 
var vx,vy : TVector;
   vm : TMatrix;
begin
 vy:=transform(vx,vm);
 writeln(vy[0]); 
end.

Title: Re: Compare code optimization of C and FreePascal !
Post by: ykot on March 22, 2017, 05:28:16 pm

Yes, -Sv option, this is what I was looking for and it is something that I've read some time ago, thanks! Changing records to arrays is not a problem as it can be done via typecasting anyway. Also, would be interesting to see what code FPC produces for arm/aarch64 (at the moment Idon't have these targets installed) as these might be more straightforward than x86.

Title: Re: Compare code optimization of C and FreePascal !
Post by: Thaddy on March 22, 2017, 05:43:10 pm

I tested only armhf -vfpv4 and intel families'. Note my remarks: it is not really mature, but sometimes I am surprised. I hope Marco's question can shed some light on what is actually supported.

Title: Re: Compare code optimization of C and FreePascal !
Post by: marcov on March 23, 2017, 03:27:38 pm

I have no answer yet, but in the source of the compiler (nadd.pas) I find:

Code: Pascal [Select][+]

     else if (cs_support_vectors in current_settings.globalswitches) and
                 is_vector(ld) and
                 is_vector(rd) and
                 equal_defs(ld,rd) then
            begin
              if not(nodetype in [addn,subn,xorn,orn,andn,muln,slashn]) then
                CGMessage3(type_e_operator_not_supported_for_types,node2opstr(nodetype),ld.typename,rd.
              { both defs must be equal, so taking left or right as resultdef doesn't matter }
              resultdef:=left.resultdef;
            end
 

leads me to belief that it is +-/* xor/or/and (and not not?)

Btw, maybe the code I posted yesterday can be rearranged so that one could use something like:

Code: Pascal [Select][+]

      r1:=v*m.m[0];
      r2:=v*m.m[1];
      r3=v*m.m[2];
      r4:=v*m.m[3];
      r1:=r1+r2;
      r3:=r3+r4;
      r1:=r1+r3;
 

Note that this is indication only, it could be that it requires transposing the incoming matrix.

Title: Re: Compare code optimization of C and FreePascal !
Post by: Thaddy on March 23, 2017, 04:09:52 pm

Good news for ykot is that nadd is part of the hlcg and therefor it is likely to work on his platforms too 8-)
Thanks for the pointer, Marco.

Title: Re: Compare code optimization of C and FreePascal !
Post by: PascalDragon on March 25, 2017, 02:37:13 pm

Quote from: ykot on March 21, 2017, 08:03:11 pm

It's been a year, but retaking this original topic, I'm trying some different compiler options.

I'm compiling 4x4 matrix multiplication code with FreePascal 3.1.1 from the trunk, compiler flags (-a, -O4, -CpCoreI, -CfSSE42, -OpCoreI, -OoFASTMATH). For the following code:
Code: Pascal [Select][+][-]
type
TMatrix = record
M: array[0..3, 0..3] of Single;

class operator Multiply(const A, B: TMatrix): TMatrix;
end;

class operator TMatrix.Multiply(const A, B: TMatrix): TMatrix;
var
I, J: Integer;
begin
for J := 0 to 3 do
for I := 0 to 3 do
Result.M[J, I] := (A.M[J, 0] * B.M[0, I]) + (A.M[J, 1] * B.M[1, I]) + (A.M[J, 2] * B.M[2, I]) +
(A.M[J, 3] * B.M[3, I]);
end;

The resulting assembly is:
Code: [Select]
.Lc1: .seh_proc MAINFM$_$TMATRIX_$__$$_star$TMATRIX$TMATRIX$$TMATRIX pushq %rbx .seh_pushreg %rbx .seh_endprologue movl $-1,%ebx .balign 8,0x90 .Lj5: addl $1,%ebx movl $-1,%r10d .balign 8,0x90 .Lj8: addl $1,%r10d movq %r8,%rax movl %r10d,%r9d movl %ebx,%r11d shlq $4,%r11 leaq (%rdx,%r11),%r11 movss (%r11),%xmm1 mulss (%rax,%r9,4),%xmm1 movl %r10d,%r9d movss 4(%r11),%xmm0 mulss 16(%rax,%r9,4),%xmm0 addss %xmm1,%xmm0 movl %r10d,%r9d movss 8(%r11),%xmm1 mulss 32(%rax,%r9,4),%xmm1 addss %xmm0,%xmm1 movl %r10d,%r9d movss 12(%r11),%xmm0 mulss 48(%rax,%r9,4),%xmm0 addss %xmm1,%xmm0 movl %ebx,%eax shlq $4,%rax movl %r10d,%r9d leaq (%rcx,%rax),%rax movss %xmm0,(%rax,%r9,4) cmpl $3,%r10d jnge .Lj8 cmpl $3,%ebx jnge .Lj5 popq %rbx ret .seh_endproc .Lc2:It appears that in aforementioned code, loops were not unrolled, so the assembly has 2 loops and elements are multiplied one at a time. I'm quite unhappy with that code.

Please note that the compiler does not unroll loops by default (except for MIPS in -O3 it seems, though I don't know why); you need to manually enable it with -OoLoopUnroll. But even then the contents of the loop you have are considered as too complex as only rather small loop contents will be unrolled (though one might argue that the compiler currently considers the loop content as more complex than it probably should).

Title: Re: Compare code optimization of C and FreePascal !
Post by: ykot on March 26, 2017, 01:57:23 am

Quote from: PascalDragon on March 25, 2017, 02:37:13 pm

Please note that the compiler does not unroll loops by default (except for MIPS in -O3 it seems, though I don't know why); you need to manually enable it with -OoLoopUnroll. But even then the contents of the loop you have are considered as too complex as only rather small loop contents will be unrolled (though one might argue that the compiler currently considers the loop content as more complex than it probably should).

-OoLoopUnroll does seem to do the job just fine, thanks for the tip!

Quote from: marcov on March 23, 2017, 03:27:38 pm

Btw, maybe the code I posted yesterday can be rearranged so that one could use something like:
Code: Pascal [Select][+][-]
r1:=v*m.m[0];
r2:=v*m.m[1];
r3=v*m.m[2];
r4:=v*m.m[3];
r1:=r1+r2;
r3:=r3+r4;
r1:=r1+r3;

In the library the matrices are typically stored in row-major order, though column-major is common in legacy OpenGL; in either case, using transposed matrices is not a big issue.
The resulting assembly, however, looks amazing:

Code: [Select]

	movq	%rcx,%rax
	movdqa	(%rdx),%xmm0
	mulps	(%r8),%xmm0
	movdqa	(%rdx),%xmm0
	mulps	16(%r8),%xmm0
	movdqa	(%rdx),%xmm0
	mulps	32(%r8),%xmm0
	movdqa	(%rdx),%xmm0
	mulps	48(%r8),%xmm0
	movdqa	(%rsp),%xmm0
	addps	16(%rsp),%xmm0
	movdqa	32(%rsp),%xmm0
	addps	48(%rsp),%xmm0
	movdqa	(%rsp),%xmm0
	addps	32(%rsp),%xmm0
	vmovups	(%rsp),%xmm0
	vmovups	%xmm0,(%rax)
	leaq	72(%rsp),%rsp
	ret

I have yet to check if the actual math is right, but it looks great, many thanks!

By the way, I'm not sure if "-Sv" (with the appropriate example that multiplies 4-sized arrays as if they were vectors) and "-OoLoopUnroll" are documented somewhere, because they seem to be very useful options.

Title: Re: Compare code optimization of C and FreePascal !
Post by: Akira1364 on March 26, 2017, 03:24:58 am

Personally I always use the following set of flags in the "custom options" menu of Lazarus whenever I want what I call a super-release build:
-CfAVX2 -CpCOREAVX2 -g- OpCOREAVX2 -Si -Sv
I also usually set -O4, -CX and -XX in the "Compilation And Linking" menu, and -Xs in the "Debugging" menu.

The resulting assembly code I typically get from all of that (when I bother to look at it, which isn't that often) is not too shabby, honestly. The compiler is pretty good at knowing when to use the AVX/AVX2 vector versions of various math instructions (IE, VMULSS instead of MULSS, e.t.c) Also, I find that passing as many record types as "constref" (not just "const" as it doesn't always work) as you possibly can helps knock down the number of lines of generated ASM quite a bit as well.