int main() {
int count=0;
for(int i=2;i<2000;i++)
if (isPrime(i)){a
count++;
}
return 0;
}
int main() {
for(int i=2;i<2000;i++) isPrime(i);
return 0;
}
int main() {
for(int i=2;i<2000;i++) ;
return 0;
}
int main() {
return 0;
}
I write some codes to find primes number between 2 and 1999 with c and free-pascal and compile them; execution time of codes are same but when compile C with -Ofast and Pascal with all optimization options C is too faster than free-pascal.
Why ?
yes I know.
But why freepascal compiler doesn't optimize code to
begin end.
?
IMHO it would be better to improve interop with GCC. So we can call Classes, Templates etc.
FPC already uses some parts from GCC like linker I think. So might as well do the rest.I wouldn't call binutils a part of GCC...
FPC already uses some parts from GCC like linker I think.
If you want C performance and VCL library use C++ Builder.
IMHO it would be better to improve interop with GCC. So we can call Classes, Templates etc.Templates are macro's in C++. So you can't ever call them. Unless you take the intermediate step of calling a c++ preprocessor. Which would kill performance of the compiler. Again, you have some silly idea's about how some things work. I suggest you take up my earlier advice and study a bit more. Frankly, you are silly. >:D 8-) O:-)
If you want C performance and VCL library use C++ Builder.
That's bad advice. You mean any C/C++ compiler other than C++Builder when performance is an issue. MS Visual studio C /C++ (free), GNU compiler suite C/C++ parts (free), you name them...
Base on your posts I think we have 2 solution to speed up applications.
1. Add some parts to fpc to compile code to c insted of assembly to use power of gcc compiler.
2.Write speed limit part of application with C and build them as static library and link them with other parts
We can use these solution until fpc optimization become power up.
Templates are macro's in C++. So you can't ever call them. Unless you take the intermediate step of calling a c++ preprocessor. ...Huh? Templates and Macros are two different concepts and, typically, templates are processed by the compiler itself.
If you want C performance and VCL library use C++ Builder.
That's bad advice. You mean any C/C++ compiler other than C++Builder when performance is an issue. MS Visual studio C /C++ (free), GNU compiler suite C/C++ parts (free), you name them...
C++ Builder now uses Clang for both win32 and 64. 8-)
Templates are macro's in C++. So you can't ever call them. Unless you take the intermediate step of calling a c++ preprocessor. ...Huh? Templates and Macros are two different concepts and, typically, templates are processed by the compiler itself.
That's not true. e.g. At least GNU C++ calls out to CPP to resolve templates. They are never resolved at runtime. And even: They are macro's per definition of the standard up until now.What do you mean by CPP? C preprocessor? So you are saying, C preprocessor handles C++ templates? This sounds like something that would be quite difficult to achieve, so any links to support your statement? :) You might want to do the same with second proposition too.
The proof is btw that templates reside in header files. No code is generated.Nope, sorry. They are not necessarily need to be placed in headers, they can go in ".cpp" files too. And even so, it doesn't mean no code is generated. For instance, "std::string" is a class completely made via templates, but being a fully functional string class, it does involve some code generation, you know. :) And I doubt it's something that goes into C preprocessor...
That's not true (different concepts,that is) . e.g. At least GNU C++ calls out to CPP to resolve templates. They are never resolved at runtime. And even: They are macro's per definition of the standard up until now.This is half true. CPP is not called to resolve templates, g++ does it internally. You can try running g++ -E against a source code having template definition. However, you're correct that it's not resolved at runtime. Template gets resolved at compile time by g++ instead of cpp. Perhaps they did so in the past, I remember they were having problems with multiple template instantiations from the same set of template parameters. Probably they move the processing stage to overcome the problem.
.Lc1:
.seh_proc MAINFM$_$TMATRIX_$__$$_star$TMATRIX$TMATRIX$$TMATRIX
pushq %rbx
.seh_pushreg %rbx
.seh_endprologue
movl $-1,%ebx
.balign 8,0x90
.Lj5:
addl $1,%ebx
movl $-1,%r10d
.balign 8,0x90
.Lj8:
addl $1,%r10d
movq %r8,%rax
movl %r10d,%r9d
movl %ebx,%r11d
shlq $4,%r11
leaq (%rdx,%r11),%r11
movss (%r11),%xmm1
mulss (%rax,%r9,4),%xmm1
movl %r10d,%r9d
movss 4(%r11),%xmm0
mulss 16(%rax,%r9,4),%xmm0
addss %xmm1,%xmm0
movl %r10d,%r9d
movss 8(%r11),%xmm1
mulss 32(%rax,%r9,4),%xmm1
addss %xmm0,%xmm1
movl %r10d,%r9d
movss 12(%r11),%xmm0
mulss 48(%rax,%r9,4),%xmm0
addss %xmm1,%xmm0
movl %ebx,%eax
shlq $4,%rax
movl %r10d,%r9d
leaq (%rcx,%rax),%rax
movss %xmm0,(%rax,%r9,4)
cmpl $3,%r10d
jnge .Lj8
cmpl $3,%ebx
jnge .Lj5
popq %rbx
ret
.seh_endproc
.Lc2:
It appears that in aforementioned code, loops were not unrolled, so the assembly has 2 loops and elements are multiplied one at a time. I'm quite unhappy with that code. movq %rcx,%rax
movq %rdx,%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss (%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 16(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 32(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 48(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,(%rax)
movq %rdx,%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss 4(%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 20(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 36(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 52(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,4(%rax)
movq %rdx,%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss 8(%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 24(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 40(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 56(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,8(%rax)
movq %rdx,%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss 12(%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 28(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 44(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 60(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,12(%rax)
leaq 16(%rdx),%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss (%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 16(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 32(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 48(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,16(%rax)
leaq 16(%rdx),%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss 4(%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 20(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 36(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 52(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,20(%rax)
leaq 16(%rdx),%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss 8(%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 24(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 40(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 56(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,24(%rax)
leaq 16(%rdx),%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss 12(%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 28(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 44(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 60(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,28(%rax)
leaq 32(%rdx),%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss (%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 16(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 32(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 48(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,32(%rax)
leaq 32(%rdx),%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss 4(%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 20(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 36(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 52(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,36(%rax)
leaq 32(%rdx),%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss 8(%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 24(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 40(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 56(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,40(%rax)
leaq 32(%rdx),%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss 12(%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 28(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 44(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 60(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,44(%rax)
leaq 48(%rdx),%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss (%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 16(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 32(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 48(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,48(%rax)
leaq 48(%rdx),%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss 4(%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 20(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 36(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 52(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,52(%rax)
leaq 48(%rdx),%rcx
movq %r8,%r9
movss (%rcx),%xmm1
mulss 8(%r9),%xmm1
movss 4(%rcx),%xmm0
mulss 24(%r9),%xmm0
addss %xmm1,%xmm0
movss 8(%rcx),%xmm1
mulss 40(%r9),%xmm1
addss %xmm0,%xmm1
movss 12(%rcx),%xmm0
mulss 56(%r9),%xmm0
addss %xmm1,%xmm0
movss %xmm0,56(%rax)
leaq 48(%rdx),%rdx
movss (%rdx),%xmm1
mulss 12(%r8),%xmm1
movss 4(%rdx),%xmm0
mulss 28(%r8),%xmm0
addss %xmm1,%xmm0
movss 8(%rdx),%xmm1
mulss 44(%r8),%xmm1
addss %xmm0,%xmm1
movss 12(%rdx),%xmm0
mulss 60(%r8),%xmm0
addss %xmm1,%xmm0
movss %xmm0,60(%rax)
ret
The assembly code above is very long and painful. The question is, due to nature of the title of this topic, how to get some comparable code optimization to what can be achieved in C++. movq %rcx,%rax
movq %rdx,%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss (%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 16(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 32(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 48(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,(%rax)
movq %rdx,%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss 4(%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 20(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 36(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 52(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,4(%rax)
movq %rdx,%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss 8(%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 24(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 40(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 56(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,8(%rax)
movq %rdx,%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss 12(%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 28(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 44(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 60(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,12(%rax)
leaq 16(%rdx),%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss (%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 16(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 32(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 48(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,16(%rax)
leaq 16(%rdx),%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss 4(%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 20(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 36(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 52(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,20(%rax)
leaq 16(%rdx),%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss 8(%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 24(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 40(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 56(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,24(%rax)
leaq 16(%rdx),%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss 12(%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 28(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 44(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 60(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,28(%rax)
leaq 32(%rdx),%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss (%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 16(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 32(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 48(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,32(%rax)
leaq 32(%rdx),%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss 4(%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 20(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 36(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 52(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,36(%rax)
leaq 32(%rdx),%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss 8(%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 24(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 40(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 56(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,40(%rax)
leaq 32(%rdx),%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss 12(%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 28(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 44(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 60(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,44(%rax)
leaq 48(%rdx),%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss (%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 16(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 32(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 48(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,48(%rax)
leaq 48(%rdx),%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss 4(%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 20(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 36(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 52(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,52(%rax)
leaq 48(%rdx),%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss 8(%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 24(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 40(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 56(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,56(%rax)
leaq 48(%rdx),%rdx
vmovss (%rdx),%xmm0
vmulss 12(%r8),%xmm0,%xmm1
vmovss 4(%rdx),%xmm0
vmulss 28(%r8),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rdx),%xmm0
vmulss 44(%r8),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rdx),%xmm0
vmulss 60(%r8),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,60(%rax)
If you care so much about this specific routine, simply convert the generated C++ routine to inline Pascal?I use to examine assembler output and optimize based on what the compiler thinks... I am often wrong... More often than not.
That's what it is there for.
It is all about a SPECIFIC C++ compiler versus a SPECIFIC pascal compiler.I've asked a concrete question backed up by evidence and on-topic to this thread. Note that Clang results are consistent with GCC and MSVC in sense that they are considerably more optimized than output produced by FreePascal - on the links I've provided you can change compilers/platforms (Clicky for MSVC 2017 (https://godbolt.org/g/cNdK9O)). Blind fanboyism in saying that "this is nothing" doesn't really help anyone. As a relief for you, the aforementioned code compiled to x64 in Delphi 10.1 is actually quite bigger than the one produced by FreePascal, but that's off-topic.
That means nothing in the context of a language. Plz compare like for likes.
If you care so much about this specific routine, simply convert the generated C++ routine to inline Pascal?I've got a whole cross-platform library full of routines like this for 2D, 3D and 4D vectors, 3x3 and 4x4 matrices and quaternions. Rewriting all methods to assembly for all platforms and target CPUs is way too much effort, so I was hoping there would be a compiler flag I missed, or maybe re-ordering the code somehow to help the compiler optimizing it. Any suggestions are definitely appreciated.
movq %rcx,%rax
movq %rdx,%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss (%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 16(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 32(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 48(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,(%rax)
movq %rdx,%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss 4(%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 20(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 36(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 52(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,4(%rax)
movq %rdx,%rcx
movq %r8,%r9
vmovss (%rcx),%xmm0
vmulss 8(%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 24(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 40(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 56(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,8(%rax)
vmovss (%rdx),%xmm0
vmulss 12(%r8),%xmm0,%xmm1
vmovss 4(%rdx),%xmm0
vmulss 28(%r8),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rdx),%xmm0
vmulss 44(%r8),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rdx),%xmm0
vmulss 60(%r8),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,12(%rax)
ret
Aforementioned FreePascal assembly output has 16 multiplication instructions versus 8 multiplications in Clang output (https://godbolt.org/g/WiFKwb) or only 4 multiplications in GCC output (https://godbolt.org/g/n3nGB5). If there would be a way to tweak the code to improve the assembly, it would be great.
so I was hoping there would be a compiler flag I missed, or maybe re-ordering the code somehow to help the compiler optimizing it. Any suggestions are definitely appreciated.
The equivalent C++ code produces around 53 instructions vs 224 from FPC (more than 4 times smaller and likely similarly faster):From my experience (using FPC since 2005), FPC is a cross-platform compiler, and it is apparently relatively easy to extend to new platforms due to the way it is designed and written. It is also designed to be easy to maintain. With all that comes a trade-off, and that trade-off is performance. It is clear from many examples that FPC doesn't generate very good performing code, but it does give you consistent cross-platform compilation support. Delphi, Kylix, GCC, CLang etc all run circles around FPC. This should not come as a surprise considering how small the development team is compared to GCC, Clang, or that Delphi and Kylix were designed with one CPU target in mind.
Aforementioned FreePascal assembly output has 16 multiplication instructions versus 8 multiplications in Clang output (https://godbolt.org/g/WiFKwb) or only 4 multiplications in GCC output (https://godbolt.org/g/n3nGB5). If there would be a way to tweak the code to improve the assembly, it would be great.
# [12] Result[0] := V[0] * M.M[0, 0] + V[1] * M.M[1, 0] + V[2] * M.M[2, 0] + V[3] * M.M[3, 0];
vmovss (%rcx),%xmm0
vmulss (%r9),%xmm0,%xmm1
vmovss 4(%rcx),%xmm0
vmulss 16(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 8(%rcx),%xmm0
vmulss 32(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vmovss 12(%rcx),%xmm0
vmulss 48(%r9),%xmm0,%xmm0
vaddss %xmm1,%xmm0,%xmm0
vmovss %xmm0,(%rax)
It's been a year, but retaking this original topic, I'm trying some different compiler options.
I'm compiling 4x4 matrix multiplication code with FreePascal 3.1.1 from the trunk, compiler flags (-a, -O4, -CpCoreI, -CfSSE42, -OpCoreI, -OoFASTMATH). For the following code:
type TMatrix = record M: array[0..3, 0..3] of Single; class operator Multiply(const A, B: TMatrix): TMatrix; end; class operator TMatrix.Multiply(const A, B: TMatrix): TMatrix; var I, J: Integer; begin for J := 0 to 3 do for I := 0 to 3 do Result.M[J, I] := (A.M[J, 0] * B.M[0, I]) + (A.M[J, 1] * B.M[1, I]) + (A.M[J, 2] * B.M[2, I]) + (A.M[J, 3] * B.M[3, I]); end;
The resulting assembly is:Code: [Select].Lc1:
It appears that in aforementioned code, loops were not unrolled, so the assembly has 2 loops and elements are multiplied one at a time. I'm quite unhappy with that code.
.seh_proc MAINFM$_$TMATRIX_$__$$_star$TMATRIX$TMATRIX$$TMATRIX
pushq %rbx
.seh_pushreg %rbx
.seh_endprologue
movl $-1,%ebx
.balign 8,0x90
.Lj5:
addl $1,%ebx
movl $-1,%r10d
.balign 8,0x90
.Lj8:
addl $1,%r10d
movq %r8,%rax
movl %r10d,%r9d
movl %ebx,%r11d
shlq $4,%r11
leaq (%rdx,%r11),%r11
movss (%r11),%xmm1
mulss (%rax,%r9,4),%xmm1
movl %r10d,%r9d
movss 4(%r11),%xmm0
mulss 16(%rax,%r9,4),%xmm0
addss %xmm1,%xmm0
movl %r10d,%r9d
movss 8(%r11),%xmm1
mulss 32(%rax,%r9,4),%xmm1
addss %xmm0,%xmm1
movl %r10d,%r9d
movss 12(%r11),%xmm0
mulss 48(%rax,%r9,4),%xmm0
addss %xmm1,%xmm0
movl %ebx,%eax
shlq $4,%rax
movl %r10d,%r9d
leaq (%rcx,%rax),%rax
movss %xmm0,(%rax,%r9,4)
cmpl $3,%r10d
jnge .Lj8
cmpl $3,%ebx
jnge .Lj5
popq %rbx
ret
.seh_endproc
.Lc2:
Please note that the compiler does not unroll loops by default (except for MIPS in -O3 it seems, though I don't know why); you need to manually enable it with -OoLoopUnroll. But even then the contents of the loop you have are considered as too complex as only rather small loop contents will be unrolled (though one might argue that the compiler currently considers the loop content as more complex than it probably should).
Btw, maybe the code I posted yesterday can be rearranged so that one could use something like:
r1:=v*m.m[0]; r2:=v*m.m[1]; r3=v*m.m[2]; r4:=v*m.m[3]; r1:=r1+r2; r3:=r3+r4; r1:=r1+r3;
movq %rcx,%rax
movdqa (%rdx),%xmm0
mulps (%r8),%xmm0
movdqa (%rdx),%xmm0
mulps 16(%r8),%xmm0
movdqa (%rdx),%xmm0
mulps 32(%r8),%xmm0
movdqa (%rdx),%xmm0
mulps 48(%r8),%xmm0
movdqa (%rsp),%xmm0
addps 16(%rsp),%xmm0
movdqa 32(%rsp),%xmm0
addps 48(%rsp),%xmm0
movdqa (%rsp),%xmm0
addps 32(%rsp),%xmm0
vmovups (%rsp),%xmm0
vmovups %xmm0,(%rax)
leaq 72(%rsp),%rsp
ret