Recent

Author Topic: Free Pascal vs C++: The First Results Are In  (Read 26720 times)

PascalDragon

  • Hero Member
  • *****
  • Posts: 6315
  • Compiler Developer
Re: Free Pascal vs C++: The First Results Are In
« Reply #45 on: December 31, 2019, 10:59:03 am »
Quote
FillQWord is likely to be the one of the Fill* routines that's the least optimized. I'd suggest you to use FillChar instead, which is usually the best optimized one.
Haha! I guess I shouldn't believe everything I read! Per the manual for FillByte:
Code: Text  [Select][+][-]
  1. When the size of the memory location to be filled out is a multiple of 2 bytes, it is better
  2. to use Fillword, and if it is a multiple of 4 bytes it is better to use FillDWord, these routines are
  3. optimized for their respective sizes.
I assume that applies to FillQWord as well.  However, the FillQWord did slow things, but I'll give FillByte a whirl.
Well, optimized for the size does not necessarily mean optimized overall as well. ;) In fact for FillQWord usually no assembly implementation exists. Even FillDWord and FillWord don't have an assembly in most targets implementation either. Only FillByte is usually implemented as assembly.

FillQWord is likely to be the one of the Fill* routines that's the least optimized. I'd suggest you to use FillChar instead, which is usually the best optimized one.
Wow! I just tested it. FillByte/Char is about 1.5x faster!

Who do I call to complain?  :D
Mere complaining probably won't help (except someone then happens to feels motivated to look at it). Either try yourself to speed up the generic FillQWord or provide an assembly implementation for it. ;)

To be honest, abusing variant records/unions to escape typing in Pascal is at least 40 years old. If not 50.

Keep in mind that to use such tricks the result first has to be stored into memory (into the union record). This is already a bottleneck. A good intrinsic that operates on a registers should easily beat it.
In this specific case a typecast would work as well as it already resides in memory:
Code: Pascal  [Select][+][-]
  1. if TDoubleRec(WaveL[i][j]).B[63] then WaveL[i][j] := Sign(WaveL[i][j]) * 2 - WaveL[i][j];

As an aside: Would this be considered a bug? In my test, Trunc was 2x faster where they should be more or less identical in speed.
The problem is that on all the x86 targets except Win64 the floating point related functions (e.g. Trunc, Frac, etc.; Floor is implemented using Trunc and Frac) are implemented using the x87 FPU instead of SSE, because the functions work for the highest available precision which is Extended which requires the x87 FPU. This is still open to investigation to implement this in a satisfying way to all the use of SSE for these functions as well.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12645
  • FPC developer.
Re: Free Pascal vs C++: The First Results Are In
« Reply #46 on: December 31, 2019, 12:59:54 pm »
FillQWord is likely to be the one of the Fill* routines that's the least optimized. I'd suggest you to use FillChar instead, which is usually the best optimized one.
Wow! I just tested it. FillByte/Char is about 1.5x faster!

Who do I call to complain?  :D

Keep in mind that qword must store 8 possibly different bytes and the incoming array might not be aligned.

Assuming fillqword is fillbyte with 8 times the size is simply wrong.  You can't easily do the standard trick of fillbyte (do a few alignment bytes with a simple store, and then the bulk with SSE registers, and then some rest bytes in a simple way again), since then you are out of alignment.

syntonica

  • Full Member
  • ***
  • Posts: 120
Re: Free Pascal vs C++: The First Results Are In
« Reply #47 on: December 31, 2019, 07:22:56 pm »
Well, optimized for the size does not necessarily mean optimized overall as well. ;) In fact for FillQWord usually no assembly implementation exists. Even FillDWord and FillWord don't have an assembly in most targets implementation either. Only FillByte is usually implemented as assembly.
Mere complaining probably won't help (except someone then happens to feels motivated to look at it). Either try yourself to speed up the generic FillQWord or provide an assembly implementation for it. ;)
I don't recall exactly how it all works, but if the memory manager is working properly, all of my data structures should be returned 8-byte aligned, or whichever fraction thereof is required. But if FillByte is already optimized to use xmm registers and vector moves, then that is what I shall use. It appears that Move also gets the same benefits.

Quote
The problem is that on all the x86 targets except Win64 the floating point related functions (e.g. Trunc, Frac, etc.; Floor is implemented using Trunc and Frac) are implemented using the x87 FPU instead of SSE, because the functions work for the highest available precision which is Extended which requires the x87 FPU. This is still open to investigation to implement this in a satisfying way to all the use of SSE for these functions as well.
I feel left out, being all 64-bit and all. I would think all the Math functions, whether intrinsic in the compiler, or in the library, would be optimized for each processor/coprocessor level. Or, thereabouts. Especially on the basics. I guess I should never assume.

I will have to then test my fast TanH and aTan approximator functions to see how they compare against the library functions. In C++, they are faster and provide "good enough" accuracy.
« Last Edit: December 31, 2019, 07:39:53 pm by syntonica »

syntonica

  • Full Member
  • ***
  • Posts: 120
Re: Free Pascal vs C++: The First Results Are In
« Reply #48 on: December 31, 2019, 07:39:09 pm »
Keep in mind that qword must store 8 possibly different bytes and the incoming array might not be aligned.

Assuming fillqword is fillbyte with 8 times the size is simply wrong.  You can't easily do the standard trick of fillbyte (do a few alignment bytes with a simple store, and then the bulk with SSE registers, and then some rest bytes in a simple way again), since then you are out of alignment.
Bah! Making me think about this stuff!  :D

In my C++ code, all my structs are manually sorted and aligned so any padding comes at the very end, after the Booleans, which are 2 or 4-byte. I spent a day attending to all this and then promptly forgot everything about how it all works. I don't recall the requirements for 8-byte alignment and just trust that the memory manager knows.

Looks like I'm going to be learning far more assembly than I ever wanted... I haven't used it since the Z80 days where I didn't have to take off my shoes to count the number of 16-bit registers.

Thanks for the explanation!

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: Free Pascal vs C++: The First Results Are In
« Reply #49 on: December 31, 2019, 07:53:16 pm »

I will have to then test my fast TanH and aTan approximator functions to see how they compare against the library functions. In C++, they are faster and provide "good enough" accuracy.

TanH and ArcTan2 are both located in unit math.

Tanh leads to function exp which is an internal fpc procedure an hopefully optimized.
ArcTan2 is completely written in Assembler - only 4 lines, the job is done by the proc with Assembler  fpatan   

Linux 64 Bit

Winni

syntonica

  • Full Member
  • ***
  • Posts: 120
Re: Free Pascal vs C++: The First Results Are In
« Reply #50 on: January 01, 2020, 01:14:52 am »

I will have to then test my fast TanH and aTan approximator functions to see how they compare against the library functions. In C++, they are faster and provide "good enough" accuracy.

TanH and ArcTan2 are both located in unit math.

Tanh leads to function exp which is an internal fpc procedure an hopefully optimized.
ArcTan2 is completely written in Assembler - only 4 lines, the job is done by the proc with Assembler  fpatan   

Linux 64 Bit

Winni
A quick test reveals that my FastTanH is about 12x faster than the library version.  O:-)
The FastAtan is about 2x slower than the library version.  >:D

Guess which one is being discarded?

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: Free Pascal vs C++: The First Results Are In
« Reply #51 on: January 01, 2020, 02:16:50 am »
Hi!

Arctan2 is just these lines, it's all up to the proc, modern times:

Code: ASM  [Select][+][-]
  1. function arctan2(y,x : float) : float;assembler;
  2.   asm
  3.      fldt y
  4.      fldt x
  5.      fpatan
  6.      fwait
  7.   end;
  8.  

Cheers
Winni

Shpend

  • Full Member
  • ***
  • Posts: 167
Re: Free Pascal vs C++: The First Results Are In
« Reply #52 on: January 01, 2020, 02:51:58 am »
So could you have gotten now a more reliable codeperformance until now with all the suggested codeperformancetipps? would be glad to know!

Happy new year btw?

syntonica

  • Full Member
  • ***
  • Posts: 120
Re: Free Pascal vs C++: The First Results Are In
« Reply #53 on: January 01, 2020, 06:06:12 am »
So could you have gotten now a more reliable codeperformance until now with all the suggested codeperformancetipps? would be glad to know!

Happy new year btw?
Most of the tips are helpful, but not all. At least in my case. It's best to make little tests to see what the performance really is for your use case.  However, unless you do a lot of something, you may not see any noticeable difference.  Out of curiosity, I test Inc(x) against x := x + 1. At about 3,000,000 iterations, Inc(x) had a couple of milliseconds edge!  :o

Still 3 more hours of 2019 left, according to my space-time nexus. I'm going to enjoy them lying in bed with a cold.  :'(

syntonica

  • Full Member
  • ***
  • Posts: 120
Re: Free Pascal vs C++: The First Results Are In
« Reply #54 on: January 07, 2020, 03:54:00 am »
Update:

I was finally able to get my GUI up and limping, so I was able to load my two test tracks that use several instances of my plugin (6 for one, 8 for the other).

Test6: C++ Ofast = 11.5%  FPC -O2 = 19.0%  FPC NoOpt: 24%
Test8: C++ Ofast = 12.0%  FPC -O2 = 17.0%  FPC NoOpt: 25%

This is with little human intervention in optimizing the code by hand. For loops have been changed to Fill/Moves where possible, with nothing more.  Not totally horrible performance!  Once I get my GUI rock solid, I'll start in on some hand-optimization.

One issue I found: When MIDI notes come in, a New Note record was created. When the note was played out, the Note record was destroyed.  I was getting random blowups, stutters and CPU races.  I switched out to using a pool of pre-made notes instead and playback is now rock solid.  Any tips on how I can play nicely with the Memory Manager? Or, is using a pool the way to go?

PascalDragon

  • Hero Member
  • *****
  • Posts: 6315
  • Compiler Developer
Re: Free Pascal vs C++: The First Results Are In
« Reply #55 on: January 07, 2020, 09:04:21 am »
Hi!

Arctan2 is just these lines, it's all up to the proc, modern times:

Code: ASM  [Select][+][-]
  1. function arctan2(y,x : float) : float;assembler;
  2.   asm
  3.      fldt y
  4.      fldt x
  5.      fpatan
  6.      fwait
  7.   end;
  8.  
The problem here is that it uses the x87 coprocessor instead of SSE which results in a slowdown due to conversion from SSE to x87 and back.

Laksen

  • Hero Member
  • *****
  • Posts: 802
    • J-Software
Re: Free Pascal vs C++: The First Results Are In
« Reply #56 on: January 07, 2020, 10:03:46 am »
Have you tried the heap checker to debug your note record allocation problem? It's just a compiler option

Also why are you not using O3?

Xor-el

  • Sr. Member
  • ****
  • Posts: 404
Re: Free Pascal vs C++: The First Results Are In
« Reply #57 on: January 07, 2020, 10:53:36 am »
Also why are you not using O3?
Or even O4? I have found it to be pretty stable.

syntonica

  • Full Member
  • ***
  • Posts: 120
Re: Free Pascal vs C++: The First Results Are In
« Reply #58 on: January 07, 2020, 07:14:54 pm »
Have you tried the heap checker to debug your note record allocation problem? It's just a compiler option

Also why are you not using O3?
I haven’t tried heaptrc yet. It's new to me and I'm happier just getting everything up and running for now. It's not so much a debug problem as a how does the memory manager work and how can I possibly be overloading it problem. If there was an issue in my note allocation/deallocation, it would definitely show with a finite number of notes available.

O3 and O4 have varying effects on speed, making one test faster, but the other test slower! I'll be looking at the individual components and their effects as clues as to what I can change in the code. I have a similar conundrum in C++. Using Os is on par with using Ofast. Unrolling loops, turning on FastMath, and enabling LTO all tend to cause slower code.

PascalDragon

  • Hero Member
  • *****
  • Posts: 6315
  • Compiler Developer
Re: Free Pascal vs C++: The First Results Are In
« Reply #59 on: January 08, 2020, 09:15:08 am »
Also why are you not using O3?
Or even O4? I have found it to be pretty stable.
Please note that -O4 might cause side effects if one relies on certain, undocumented behaviour (e.g. it activates field reordering on classes and if you rely on "cracker classes" those will fail).

 

TinyPortal © 2005-2018