Recent

Author Topic: Free Pascal vs C++: The First Results Are In  (Read 26531 times)

syntonica

  • Full Member
  • ***
  • Posts: 120
Re: Free Pascal vs C++: The First Results Are In
« Reply #15 on: December 30, 2019, 03:30:14 am »
Hi!

I don't know what your audio plugin is doing but I have a home grown audioplayer (work in progress) that uses BASS: 4.6% .. 5.9% CPU while playing .

Even the bad coded and full of useless features player Amarok needs 12% .. 14% while playing

Values measured by top

Ryzen 4 cores

Winni
This isn't a program running an audio codec, but a synthesizer plugin. They can eat anywhere from 5 to 50% of my CPU just playing one note, depending. Thus, my happiness if I can drop from 12% to 6%.

Currently, I'm fighting with v3.3.1, which I got compiled, but now I need to integrate it without too much clobbering of my 3.0.4. Or do I?

syntonica

  • Full Member
  • ***
  • Posts: 120
Re: Free Pascal vs C++: The First Results Are In
« Reply #16 on: December 30, 2019, 04:16:57 am »
Well, that was much ado about nothing. :/  I compiled using 3.3.1 and got the exact same results...

So, I'm open to suggestions on how to shave off the percentages. Thanks!

mr-highball

  • Full Member
  • ***
  • Posts: 233
    • Highball Github
Re: Free Pascal vs C++: The First Results Are In
« Reply #17 on: December 30, 2019, 04:53:08 am »
I don't have too much experience in the audio plugin space, but for future reference, fpcupdeluxe may be a good friend to you when trying to keep multiple versions of fpc/lazarus or just keeping trunk updated,
https://github.com/LongDirtyAnimAlf/fpcupdeluxe
(Go to releases tab -> get binary for platform)

Without seeing some of the code (or sections that are the biggest bottleneck) it may be tough to get some useful answers.
Haven't really used it much, but there are a few profiling options
https://wiki.lazarus.freepascal.org/Profiling

syntonica

  • Full Member
  • ***
  • Posts: 120
Re: Free Pascal vs C++: The First Results Are In
« Reply #18 on: December 30, 2019, 05:07:51 am »
Without seeing some of the code (or sections that are the biggest bottleneck) it may be tough to get some useful answers.
Haven't really used it much, but there are a few profiling options
https://wiki.lazarus.freepascal.org/Profiling
Thanks, but this is C++ code that I've translated to Pascal. It's already been through the grinder to be optimized in C++.

mr-highball

  • Full Member
  • ***
  • Posts: 233
    • Highball Github
Re: Free Pascal vs C++: The First Results Are In
« Reply #19 on: December 30, 2019, 05:14:09 am »
¯\_(ツ)_/¯ just saiyan without seeing the slow parts there's little anyone could offer here that's not a shot in the dark. Not trying to be rude, but code speaks better than a forum post.
Also, although it may be optimized for c++ it doesn't mean those same optimizations will translate to fpc

syntonica

  • Full Member
  • ***
  • Posts: 120
Re: Free Pascal vs C++: The First Results Are In
« Reply #20 on: December 30, 2019, 05:57:09 am »
Here's the code for a wavefolder from one of my tight loops. I used it as an example to test against C originally:
Code: Pascal  [Select][+][-]
  1. begin
  2.   l := Floor(Knob2 * 1023);
  3.   for j := 0 to 1023 do
  4.     begin
  5.       WaveL[i][j] := (WaveL[Wave1][j] * (1 - Knob1) + WaveL[Wave2][j + l] * Knob1) * 2;
  6.       WaveR[i][j] := (WaveR[Wave1][j] * (1 - Knob1) + WaveR[Wave2][j + l] * Knob1) * 2;
  7.       if (Abs(WaveL[i][j]) > 1) then WaveL[i][j] := Sign(WaveL[i][j]) * 2 - WaveL[i][j];
  8.       if (Abs(WaveR[i][j]) > 1) then WaveR[i][j] := Sign(WaveR[i][j]) * 2 - WaveR[i][j];
  9.     end;
  10. end;
  11.  
Every value, except for indices i and j, is a Double.  The knobs can range from 0.0 to 1.0.  Knob 1 handles the mix and knob 2 handles the phase.

In the original test I made, C at -Ofast came in at about 18ms. Pascal at -O4 came in at 18-20ms, as coded. I can live with the 10% difference. The majority of these loops can't be vectorized, or won't benefit from it, per clang. This is why I have no worries about bottlenecks.  Oddly, my attempts so far at tricking the compiler only seem to result in slower code! And that's a big bravo for the compiler writers.

I've just looked at the assembly and it appears that I'm not getting any autovectorization, so maybe I'll have to manually convert some of the loops. I thought the optimizer would give me the easy stuff for some reason. Looks like another rabbit hole for me to dive down on top of learning FP and command-line compilation and debugging...  %)

syntonica

  • Full Member
  • ***
  • Posts: 120
Re: Free Pascal vs C++: The First Results Are In
« Reply #21 on: December 30, 2019, 07:28:43 am »
Code: Pascal  [Select][+][-]
  1. for i := 0 to n do
  2.  x := x + foo[i]
may be better written as
Code: Pascal  [Select][+][-]
  1. pfoo := @foo[0];
  2. for i := 0 to n do begin
  3.   x := x + pfoo^;
  4.   inc(pfoo);
  5. end;
Just for grins, since I'm still learning how to read assembly, I checked both versions.  Looks like the pointer version is cleaner and, by my rough meter, not faster. However, if I do this to all my loops, it will probably make some difference.

Array version:
Code: Pascal  [Select][+][-]
  1. Lj614:
  2.         addl    $1,%ebx
  3. # [262] Outputs[0][j] := 0;
  4.         movq    48(%rsp),%rax
  5.         movq    (%rax),%rcx
  6.         movslq  %ebx,%rdx
  7.         movq    _$SEISMICX$_Ld3@GOTPCREL(%rip),%rax
  8.         movl    (%rax),%eax
  9.         movl    %eax,(%rcx,%rdx,4)
  10. # [263] Outputs[1][j] := 0;
  11.         movq    48(%rsp),%rax
  12.         movq    8(%rax),%rcx
  13.         movslq  %ebx,%rdx
  14.         movq    _$SEISMICX$_Ld3@GOTPCREL(%rip),%rax
  15.         movl    (%rax),%eax
  16.         movl    %eax,(%rcx,%rdx,4)
  17.         cmpl    %ebx,%esi
  18.         jg      Lj614
  19.  

Pointer version:
Code: Pascal  [Select][+][-]
  1. Lj618:
  2.         addl    $1,%ebx
  3. # [264] outL^ := 0;
  4.         movq    _$SEISMICX$_Ld5@GOTPCREL(%rip),%rax
  5.         movq    (%rax),%rax
  6.         movq    %rax,(%rdx)
  7. # [265] outR^ := 0;
  8.         movq    _$SEISMICX$_Ld5@GOTPCREL(%rip),%rax
  9.         movq    (%rax),%rax
  10.         movq    %rax,(%rcx)
  11. # [266] Inc(outL);
  12.         addq    $8,%rdx
  13. # [267] Inc(OutR);
  14.         addq    $8,%rcx
  15.         cmpl    %ebx,%esi
  16.         jg      Lj618
  17.  
Although it still zeroes out %rax twice. The wrong way. ;D  I tried setting a double variable to 0 and the result uses 4 movsd with MMX registers vs the 6 movq above. I'm having issues reading the ops/latency chart to know if this is a good thing or not.  Anyway, thank you for your helpful and constructive advice.


PascalDragon

  • Hero Member
  • *****
  • Posts: 6284
  • Compiler Developer
Re: Free Pascal vs C++: The First Results Are In
« Reply #22 on: December 30, 2019, 09:42:04 am »
I have converted all the C++ consts to constrefs
It might not help anything, but you might want to use const instead, cause unlike for constref (which is always passed by reference) the compiler is free to pass it in a more optimal way. As said, it might not help in your situation, but that's the general rule of thumb: only use constref when you really, really need a reference and not for optimization.

I got about .25-.5% improvement with Move, which is used heavily, but FillQWord turned out to be probably slower than plain for loops.
FillQWord is likely to be the one of the Fill* routines that's the least optimized. I'd suggest you to use FillChar instead, which is usually the best optimized one.

Additionally you could try to use the C memory manager by using the cmem unit as the first unit in your program file, just to check whether the RTL's heap is making a difference here.

avra

  • Hero Member
  • *****
  • Posts: 2580
    • Additional info
Re: Free Pascal vs C++: The First Results Are In
« Reply #23 on: December 30, 2019, 09:53:36 am »
Code: Pascal  [Select][+][-]
  1. begin
  2.   l := Floor(Knob2 * 1023);
  3.   for j := 0 to 1023 do
  4.     begin
  5.       WaveL[i][j] := (WaveL[Wave1][j] * (1 - Knob1) + WaveL[Wave2][j + l] * Knob1) * 2;
  6.       WaveR[i][j] := (WaveR[Wave1][j] * (1 - Knob1) + WaveR[Wave2][j + l] * Knob1) * 2;
  7.       if (Abs(WaveL[i][j]) > 1) then WaveL[i][j] := Sign(WaveL[i][j]) * 2 - WaveL[i][j];
  8.       if (Abs(WaveR[i][j]) > 1) then WaveR[i][j] := Sign(WaveR[i][j]) * 2 - WaveR[i][j];
  9.     end;
  10. end;

Have you checked what happens if you you use helper variables for (1-Knob1) and (j+1) and use them instead in those 2 places (and put calculation outside of the loop for 1-Knob1), and also replace all *2 multiplications with a single addition? You can also make your own inlined Sign() function which directly returns -2 if highest bit in qword(double) is true, or +2 otherwise (https://en.wikipedia.org/wiki/Double-precision_floating-point_format).

You can also try compiler switches -OG -Or -Op3 -Ou (if they are not already included with -O4):
https://www.freepascal.org/docs-html/current/prog/progse49.html

Also check out:
https://www.freepascal.org/docs-html/prog/progsu58.html
https://wiki.freepascal.org/Optimization

Maybe you could try to benchmark fixed point math vs floating point math. CPU coprocessors are now very advanced so to gain from it you would probably need to make your own custom fixed point math type.

EDIT: "if highest bit in qword(double) is true" can also be tested as int64(double)<0, which might be faster.
« Last Edit: December 30, 2019, 11:10:23 am by avra »
ct2laz - Conversion between Lazarus and CodeTyphon
bithelpers - Bit manipulation for standard types
pasettimino - Siemens S7 PLC lib

ccrause

  • Hero Member
  • *****
  • Posts: 1088
Re: Free Pascal vs C++: The First Results Are In
« Reply #24 on: December 30, 2019, 10:53:24 am »
Code: Pascal  [Select][+][-]
  1. begin
  2.   l := Floor(Knob2 * 1023);
  3.   for j := 0 to 1023 do
  4.     begin
  5.       WaveL[i][j] := (WaveL[Wave1][j] * (1 - Knob1) + WaveL[Wave2][j + l] * Knob1) * 2;
  6.       WaveR[i][j] := (WaveR[Wave1][j] * (1 - Knob1) + WaveR[Wave2][j + l] * Knob1) * 2;
  7.       if (Abs(WaveL[i][j]) > 1) then WaveL[i][j] := Sign(WaveL[i][j]) * 2 - WaveL[i][j];
  8.       if (Abs(WaveR[i][j]) > 1) then WaveR[i][j] := Sign(WaveR[i][j]) * 2 - WaveR[i][j];
  9.     end;
  10. end;

Have you checked what happens if you you use helper variable for (1-Knob1) and use that instead in those 2 places (and put calculation outside of the loop), and also replace all *2 multiplications with a single addition?

A random thought: Split into two loops so that one block of data is manipulated at a time - could give better cache performance. Not even sure if this is relevant on a modern CPU.

mischi

  • Full Member
  • ***
  • Posts: 189
Re: Free Pascal vs C++: The First Results Are In
« Reply #25 on: December 30, 2019, 11:00:27 am »
Some ideas:

Explicitly avoiding the multiple references to WaveL and WaveR might help, using a simple temporary variable and an assign at the end, although a good optimizer might do the job.

Code: Pascal  [Select][+][-]
  1.       WaveL_tmp := (WaveL[Wave1][j] * (1 - Knob1) + WaveL[Wave2][j + l] * Knob1) * 2;
  2.       WaveR_tmp := (WaveR[Wave1][j] * (1 - Knob1) + WaveR[Wave2][j + l] * Knob1) * 2;
  3.       if (Abs(WaveL_tmp) > 1) then WaveL_tmp := Sign(WaveL_tmp) * 2 - WaveL_tmp;
  4.       if (Abs(WaveR_tmp) > 1) then WaveR_tmp := Sign(WaveR_tmp) * 2 - WaveR_tmp;
  5.       WaveL[i][j] := WaveL_tmp;
  6.       WaveR[i][j] := WaveR_tmp;

Also, avoiding the calls to abs and sign might help:
Code: Pascal  [Select][+][-]
  1.       if (WaveL_tmp > 1) then
  2.         WaveL_tmp :=  2 - WaveL_tmp
  3.       else if (WaveL_tmp < -1) then
  4.         WaveL_tmp :=  -2 - WaveL_tmp;
Wondering, whether they are inlined.

As ccrause suggested, if the two arrays WaveL and WaveR are large, there might be cache clobbering and two separate loops might be faster.

PascalDragon

  • Hero Member
  • *****
  • Posts: 6284
  • Compiler Developer
Re: Free Pascal vs C++: The First Results Are In
« Reply #26 on: December 30, 2019, 11:20:09 am »
Also, avoiding the calls to abs and sign might help:
Code: Pascal  [Select][+][-]
  1.       if (WaveL_tmp > 1) then
  2.         WaveL_tmp :=  2 - WaveL_tmp
  3.       else if (WaveL_tmp < -1) then
  4.         WaveL_tmp :=  -2 - WaveL_tmp;
Wondering, whether they are inlined.
Sign is inlined, Abs is a compiler intrinsic.

syntonica

  • Full Member
  • ***
  • Posts: 120
Re: Free Pascal vs C++: The First Results Are In
« Reply #27 on: December 30, 2019, 12:04:20 pm »
Insomnia strikes!


Have you checked what happens if you you use helper variables for (1-Knob1) and (j+1) and use them instead in those 2 places (and put calculation outside of the loop for 1-Knob1)
Yes. Made things slower for both Pascal and C. In the few instances I need to do "(j + l) and $07ff", it is faster breaking it out. I never realized and was so slow!

Quote
You can also try compiler switches -OG -Or -Op3 -Ou (if they are not already included with -O4):
All 4 are obsolete and sadly, did no good.

Quote
Maybe you could try to benchmark fixed point math vs floating point math. CPU coprocessors are now very advanced so to gain from it you would probably need to make your own custom fixed point math type.

EDIT: "if highest bit in qword(double) is true" can also be tested as int64(double)<0, which might be faster.
Fixed point was common in the old days when you were sending out to a digital->analog converter, or when an FPU was a luxury. That would be a big undertaking to switch everything over to int64. And the conversion from fixed point -> floating point, single or double. I'm not sure how fast that could be.  The range I need goes well beyond the -1 to 1 for some operations, like FFT and convolution. I also need the long mantissa that floating point provides so tiny adds don't get lost. I'm not sure if the exponent will leave enough bits for the whole part.




syntonica

  • Full Member
  • ***
  • Posts: 120
Re: Free Pascal vs C++: The First Results Are In
« Reply #28 on: December 30, 2019, 12:13:35 pm »
It might not help anything, but you might want to use const instead, cause unlike for constref (which is always passed by reference) the compiler is free to pass it in a more optimal way. As said, it might not help in your situation, but that's the general rule of thumb: only use constref when you really, really need a reference and not for optimization.
I read on the board from somebody they had better luck with constref, but that may have been a slightly different context.

Quote
FillQWord is likely to be the one of the Fill* routines that's the least optimized. I'd suggest you to use FillChar instead, which is usually the best optimized one.
Haha! I guess I shouldn't believe everything I read! Per the manual for FillByte:
Code: Text  [Select][+][-]
  1. When the size of the memory location to be filled out is a multiple of 2 bytes, it is better
  2. to use Fillword, and if it is a multiple of 4 bytes it is better to use FillDWord, these routines are
  3. optimized for their respective sizes.
I assume that applies to FillQWord as well.  However, the FillQWord did slow things, but I'll give FillByte a whirl.

Quote
Additionally you could try to use the C memory manager by using the cmem unit as the first unit in your program file, just to check whether the RTL's heap is making a difference here.
That's not really that much of a concern.  By the time the plugin is up and running, it's grabbed all the memory it needs.  I've purposely avoided using alloc/free in both C and pascal since it can cause a stutter if it decides to reorganize.

syntonica

  • Full Member
  • ***
  • Posts: 120
Re: Free Pascal vs C++: The First Results Are In
« Reply #29 on: December 30, 2019, 12:21:26 pm »
Also, avoiding the calls to abs and sign might help:
Interestingly, C (or C++11) doesn't have a native Sign function, so I use a branchless, inline version:
Code: C  [Select][+][-]
  1. inline double sign(const double i)
  2.   {
  3.   return (i > 0.0) - (i < 0.0);
  4.   }
  5.  
It seems to be quite speedy on C compared to an if/elseif/then version.  I haven't really looked too hard at the Pascal analogs to see how fast they really are as they seem quite well optimized.



 

TinyPortal © 2005-2018