Print Page - For Loops and Registers

Free Pascal => Beginners => Topic started by: syntonica on December 09, 2019, 06:08:18 am

Title: For Loops and Registers
Post by: syntonica on December 09, 2019, 06:08:18 am

I've been playing with small snippets and compiler optimizations. Nothing I specified seemed to do any good, so I finally looked at the assembly. The loop variables are being stored in memory slots and not registers! >:(

Here is the code:

Code: Pascal [Select][+]

program p;
 
 
uses SysUtils, DateUtils, Math;
 
 
var
  i, j: Int64;
  r: Single;
  t1, t2: TDateTime;
begin
  r := 0;
  t1 := Now();
  for j:=1 to 10000 do begin
    for i:=1 to 100 do begin
      r := r + (power((1+SQRT(5)) * 0.5,i)-power((1-SQRT(5)) * 0.5,i))/SQRT(5)
    end;
  end;
  t2 := Now();
  WriteLn(r,' ',MilliSecondsBetween(t2, t1));
end.
 

Here are my compiler switches:

Code: Pascal [Select][+]

fpc -O4 -a -CpCOREI -CfSSSE3 -Sv "%f"
 

What am I doing wrong? :'(

Title: Re: For Loops and Registers
Post by: PascalDragon on December 09, 2019, 09:30:23 am

You need to put your code into a function/procedure and call that. If the code is part of the main program block then the variables are global variables which are considered volatile and thus are not optimized into registers by design.

Title: Re: For Loops and Registers
Post by: syntonica on December 09, 2019, 10:34:36 am

Thanks for the tip! I was wondering about that. Here's Mark II:

Code: Pascal [Select][+]

 
program p;
 
 
uses SysUtils, DateUtils, Math;
 
 
var
  t1, t2: TDateTime;
  r: Single;
 
 
function calc() : Single;
  var
    i, j: Integer;
  begin
    calc := 0;
    for j:=1 to 10000 do begin
      for i:=1 to 100 do begin
        calc := calc + (power((1+SQRT(5))/2,i)-power((1-SQRT(5))/2,i))/SQRT(5)
      end;
    end;
  end;
 
 
begin
  t1 := Now();
  r := calc;
  t2 := Now();
        WriteLn(r,' ',MilliSecondsBetween(t2, t1));
end.
 

I gained about 8ms on my machine. As it stands, Pascal runs at about 140ms avg and my C version compiled with clang is at about 90ms avg per run, both optimized with -O4.

I have to say, the disparity is very disheartening. The DSP code I wish to port over relies on tight loops that cannot be vectorized due to the complex calculations within. If the results I have been seeing in my comparison tests were in the neighborhood, I'd take the plunge, but it looks like FP is not for my use case. And I really like the idea of being able to cross-compile the Windows version on my Mac! If I was making a command-line or desktop application, Pascal would be a no-brainer, but right now, I have a need for speed.

Unless somebody has any suggestions (other than raw assembly), I will probably have to move on.

Title: Re: For Loops and Registers
Post by: Thaddy on December 09, 2019, 11:38:14 am

Why do you do not pre-calculated consts?

Code: Pascal [Select][+]

program p;
{$mode delphi}{$I-}
uses 
  SysUtils, DateUtils, Math;
 
var
  t1, t2: TDateTime;
  r: Single; 
  
function calc: Single;inline;
const
  sqrt5 = SQRT(5); 
  delta = (1+sqrt5)/2; 
var
  i, j: Integer;
begin
  result:=0;
  for j:=1 to 10000 do 
    for i:=1 to 100 do 
      result := result + (power(delta,i)-power(delta,i))/sqrt5
end;
 
begin
  t1 := Now;
  r := calc;
  t2 := Now;
  WriteLn(r,' ',MilliSecondsBetween(t2, t1));
end.

(And why do you count from 1, but that's another matter)

Title: Re: For Loops and Registers
Post by: marcov on December 09, 2019, 11:43:52 am

I checked, and power() is not inline, and double only on win64. I think the constants are inline.

The generated code itself doesn't look really shabby, nor really neat. Well, that is the state of FPC.

Title: Re: For Loops and Registers
Post by: Thaddy on December 09, 2019, 11:58:30 am

I have a slight improvement on ARMHF with my code.
Usually for real-time DSP you would use an approximation function instead of Power().

Title: Re: For Loops and Registers
Post by: syntonica on December 09, 2019, 03:02:57 pm

Quote from: Thaddy on December 09, 2019, 11:38:14 am

Why do you do not pre-calculated consts?

(And why do you count from 1, but that's another matter)

The compiler should be calculating the constants and substituting the result.

(And counting from 1 because I was using the code for something else with divides, I think.)

Title: Re: For Loops and Registers
Post by: syntonica on December 09, 2019, 03:23:52 pm

Also, even with you zeroing out the function, there's no speed up in the code. :D

Title: Re: For Loops and Registers
Post by: winni on December 09, 2019, 04:46:13 pm

HI!

Dont call the function calc while the rest of the programm is still busy with initialization.
It makes a huge difference.

I called the function calc in an OnClick event and here are the values from my maschine:

Direct started by program: 457 ms
Started by OnClick: 212 ms

Winni

PS Linux 64 gtk2

Title: Re: For Loops and Registers
Post by: marcov on December 09, 2019, 04:49:13 pm

Quote from: Thaddy on December 09, 2019, 11:38:14 am

(And why do you count from 1, but that's another matter)

That's probably because x^0 is fairly predictable.

Title: Re: For Loops and Registers
Post by: Thaddy on December 09, 2019, 05:45:31 pm

Quote from: syntonica on December 09, 2019, 03:02:57 pm

The compiler should be calculating the constants and substituting the result.

Although the compiler can often do such things - nowadays - it also shows the capability of the programmer...
A real programmer would have analyzed the issue and - using whatever language at hand - optimized it by improving the algorithm.
That does not mean your code becomes faster, it means that you understand your code.

Title: Re: For Loops and Registers
Post by: syntonica on December 09, 2019, 06:37:08 pm

Quote from: Thaddy on December 09, 2019, 05:45:31 pm

Although the compiler can often do such things - nowadays - it also shows the capability of the programmer...
A real programmer would have analyzed the issue and - using whatever language at hand - optimized it by improving the algorithm.
That does not mean your code becomes faster, it means that you understand your code.

A "real" programmer wouldn't have screwed up refactoring a function either, so let's leave fallacies and ad hominems out of it and stick to the matter at hand.

The whole formula is just giving the CPU something to do. Like I said, The compiler should be calculating the constants and substituting the result, which it was doing. I'm looking at the optimizer to see what it can (and cannot) do.

At this point, I'm going to have to test out a much larger, more complicated example. My little one-offs are not capable of much optimization, so I'm going to hit it with a low-pass filter and run some samples through it.

Title: Re: For Loops and Registers
Post by: syntonica on December 09, 2019, 09:40:29 pm

Hooray! It looks like Pascal is going to work!

Here is a small snippet of the type of processing (wavefolding) I need to do:

Code: Pascal [Select][+]

 
 
 
program test5;
 
 
Uses Math, SysUtils, DateUtils;
 
 
var
  w1, w2, w3: array [0 .. 1023] of Double;
  k1: Double;
  j, k, l: Integer;
  t1, t2: TDateTime;
  x: Int64;
  
begin
  { stuff some random values to look like audio samples }
  for j := 0 to 1023 do
    begin
      w2[j] := Random * 2.0 - 1.0;
      w3[j] := Random * 2.0 - 1.0;
    end;
  l  := Round(Random * 1023.0);
  k1 := Random * 1023.0;
  { time the actual work }
  t1 := Now();
  for x:=1 to 1000 do
    for j := 0 to 1023 do
      begin
        k := (j + l) and 1023;
        w1[j] := (w2[j] * (1.0 - k1) + w3[k] * k1) * 2.0;
        if Abs(w1[j]) > 1.0 then w1[j] := Sign(w1[j]) * 2.0 - w1[j];
        w1[j] := (w3[j] * (1.0 - k1) + w2[k] * k1) * 2.0;
        if Abs(w1[j]) > 1.0 then w1[j] := Sign(w1[j]) * 2.0 - w1[j];
      end;
  t2 := Now();
  WriteLn(MilliSecondsBetween(t2, t1));
end.
 

FPC ranges from 17 to 20ms. C-code with GCC averages between 18 and 19ms.
Looks like y'all are stuck with me. :P

Title: Re: For Loops and Registers
Post by: marcov on December 09, 2019, 09:47:58 pm

I've been translating a FFT routine to assembler in lost moments the last few weeks. Looks a bit like this code. Though this code makes no sense since it assigns values to w1[j] that are provenly never used?

Title: Re: For Loops and Registers
Post by: syntonica on December 09, 2019, 10:15:26 pm

Quote from: marcov on December 09, 2019, 09:47:58 pm

I've been translating a FFT routine to assembler in lost moments the last few weeks. Looks a bit like this code. Though this code makes no sense since it assigns values to w1[j] that are provenly never used?

It's just heavily stripped down from my codebase, so none of it will probably make sense to anybody. I chose at random one of ~75 procedures that stuff an array to be used to generate the final audio. The C-code test is uglier since there's no native signum function. C has a couple of really weird insufficiencies.

Does your FFT routine use complex numbers? I have a routine that does it in place in a 1024 sample window, no complex numbers needed.

Edit: And I just realized I did that one Random wrong. Still getting the hang of things...

Title: Re: For Loops and Registers
Post by: winni on December 09, 2019, 10:32:59 pm

Quote from: syntonica on December 09, 2019, 10:15:26 pm

Still getting the hang of things...

One tip for reducing the overhead of the milliseconds:

Code: Pascal [Select][+]

T1: QWord;
 
…...
T1 := GetTickCount64;
 
......
 
writeln (GetTickCount64 - T1);

Winni

Title: Re: For Loops and Registers
Post by: syntonica on December 09, 2019, 10:41:21 pm

Quote from: winni on December 09, 2019, 10:32:59 pm

Quote from: syntonica on December 09, 2019, 10:15:26 pm
Still getting the hang of things...

One tip for reducing the overhead of the milliseconds:
Code: Pascal [Select][+][-]
T1: QWord;

…...
T1 := GetTickCount64;

......

writeln (GetTickCount64 - T1);

Winni

Thanks! But it looks like this is ticks of indeterminate length. I need real time since I'm comparing FP with C. It'll be useful, though, when I need to start hand-optimizing.

Title: Re: For Loops and Registers
Post by: winni on December 09, 2019, 10:54:42 pm

Hi!

GetTickCount64 is defined as milliseconds since systemstart.

It relies on the OS ticks.

It is the same 'exact' as now is.

Winni

PS: After 5.8E14 years it will wrap to 0!! Take care!

Title: Re: For Loops and Registers
Post by: syntonica on December 09, 2019, 11:25:21 pm

Quote from: winni on December 09, 2019, 10:54:42 pm

Hi!

GetTickCount64 is defined as milliseconds since systemstart.

It relies on the OS ticks.

It is the same 'exact' as now is.

Winni

PS: After 5.8E14 years it will wrap to 0!! Take care!

Oh, cool! The docs said: "It is useful for time measurements, but no assumptions should be made as to the interval between the ticks."

How trustworthy are the docs? Apple's are pretty awful, even to the point of just being flat out wrong, and not because they're outdated.

The sun only has 5.0E12 years before it dies... Really put's things in perspective. :-\

Title: Re: For Loops and Registers
Post by: marcov on December 09, 2019, 11:34:49 pm

Quote from: syntonica on December 09, 2019, 10:15:26 pm

Does your FFT routine use complex numbers? I have a routine that does it in place in a 1024 sample window, no complex numbers needed.

Yes. But I need mixed radix because I have a 400 sample window. But it would always be good to see a different solution.

The assembler now parallelizes most re:im operations though, and even does two complex per instruction about half the time. It is not production ready yet though, just preparation for a planned move to 64-bit of our only remaining 32-bit application.

This application uses a lot of floating point calculations and is 140-200% slower than on 64-bit (using Delphi btw). Worse, the number of calculations is only going to increase, so buying yourself out of trouble with newer hardware is costly.

That said I benchmark with Delphi mostly and Delphi converts every single load to double and back and does all main operations in double.

Most of the calculations are not repeated a lot or parallelize, so I mostly started analysing (creating a good benchmark) and tackling a few simple but common primitives, FFT included.

Title: Re: For Loops and Registers
Post by: winni on December 10, 2019, 12:03:12 am

@syntonica

I just hat a look into the internals: Linux takes the ticks from the libc, Windows uses kerne32.dll to get the ticks.

And the joke with the 5.8E14 years has this Background: The predecessor of GetTickCount64 is GetTickCount - 32 bit.

In old days a lot of companies were happy that they had Windows NT 3.5 - a "stable" Windows. But from time to time NT crashed. Nobody knew why. Then the logic appeared: MaxDWord Milliseconds are 49.xx days - and then the server crashed. So you had to reboot the system after 48 days. And the GetTickCount64 came around and everybody was computing, when the overfow will happen ....

Winni

Title: Re: For Loops and Registers
Post by: syntonica on December 10, 2019, 12:04:28 am

Here's my code in C/C++. I just use it as static methods. It's been specifically stripped down to only work on 1024 sample windows. Unfortunately, the original has no attribution.

Title: Re: For Loops and Registers
Post by: syntonica on December 10, 2019, 12:10:44 am

Quote from: winni on December 10, 2019, 12:03:12 am

In old days a lot of companies were happy that they had Windows NT 3.5 - a "stable" Windows. But from time to time NT crashed. Nobody knew why. Then the logic appeared: MaxDWord Milliseconds are 49.xx days - and then the server crashed. So you had to reboot the system after 48 days. And the GetTickCount64 came around and everybody was computing, when the overfow will happen ....

Winni

I completely forgot about that! When that was happening, the company I was working for was running Novell. Rock solid, but don't accidentally print 500 pages as my sysadmin couldn't find a way to kill a print job.

Title: Re: For Loops and Registers
Post by: jamie on December 10, 2019, 12:55:45 am

Well, it seems I can't update a PAS file...

I was going to offer you a "fourier.pas" file
oh well,...

http://jean-pierre.moreau.pagesperso-orange.fr/p_signal.html

Title: Re: For Loops and Registers
Post by: PascalDragon on December 10, 2019, 09:44:50 am

Quote from: syntonica on December 09, 2019, 11:25:21 pm

Oh, cool! The docs said: "It is useful for time measurements, but no assumptions should be made as to the interval between the ticks."

For the upcoming 3.2 release the docs have been updated some months ago (not yet published). It now reads like this:

Quote

GetTickCount64 returns an increasing clock tick count in milliseconds.
It is useful for time measurements, but no assumptions should be made as to the interval between the ticks.

This means that the unit is always considered to be milliseconds, but the accuracy depends on the platform. E.g. if it would be used on a hypothetical platform that only supports a timer with a frequency of a second than you'd only get multiples of 1000 as a result.

Quote from: syntonica on December 09, 2019, 11:25:21 pm

How trustworthy are the docs? Apple's are pretty awful, even to the point of just being flat out wrong, and not because they're outdated.

As everything in FPC and Lazarus they are done by volunteers. So they can either be incomplete or could be improved, but they shouldn't be wrong. However unlike with Apple you can post a bug report and more often than not they are fixed rather quickly (though the changes will only be visible in the next release).

Title: Re: For Loops and Registers
Post by: marcov on December 10, 2019, 09:51:09 am

Quote from: syntonica on December 10, 2019, 12:04:28 am

Here's my code in C/C++. I just use it as static methods. It's been specifically stripped down to only work on 1024 sample windows. Unfortunately, the original has no attribution.

Thanks. I filed it for future reference. But yeah, it is really power of 2 only, not mixed radix.

Currently the 400 samples are related to camera and lighting, (400/10s = 40/s), but while it would be a whale to change, you never know long term. But in general the various physical factors get priority over the samplesize.

I got my mixed radix from this page https://www.simdesign.nl/components.html but originally I did only 10/frame, so performance was not THAT important.

Meanwhile I tripled the number of cameras and it is going in the direction of hundreds/frame.

Title: Re: For Loops and Registers
Post by: syntonica on December 10, 2019, 05:08:06 pm

Quote from: marcov on December 10, 2019, 09:51:09 am

Quote from: syntonica on December 10, 2019, 12:04:28 am
Here's my code in C/C++. I just use it as static methods. It's been specifically stripped down to only work on 1024 sample windows. Unfortunately, the original has no attribution.

Thanks. I filed it for future reference. But yeah, it is really power of 2 only, not mixed radix.

Currently the 400 samples are related to camera and lighting, (400/10s = 40/s), but while it would be a whale to change, you never know long term. But in general the various physical factors get priority over the samplesize.

I got my mixed radix from this page https://www.simdesign.nl/components.html (https://www.simdesign.nl/components.html) but originally I did only 10/frame, so performance was not THAT important.

Meanwhile I tripled the number of cameras and it is going in the direction of hundreds/frame.

40 seems an odd number for video, though. I'm used to the usual suspects when it comes to frame rates. Also, I'm assuming that FFT is used to transform luminance and color channels to the frequency domain? So much interesting stuff out there, so little time to look at it all!

Title: Re: For Loops and Registers
Post by: marcov on December 10, 2019, 05:55:20 pm

Quote from: syntonica on December 10, 2019, 05:08:06 pm

40 seems an odd number for video, though.

These are industrial cameras doing real frames, not compressed streams.

Quote

I'm used to the usual suspects when it comes to frame rates. Also, I'm assuming that FFT is used to transform luminance and color channels to the frequency domain? So much interesting stuff out there, so little time to look at it all!

No. It is to filter image primitives like edges. It works because the camera is looking at an object on a turntable so there is some sinusoid tendency in the data, so discarding higher order terms is a noise filter that doesn't change amplitude too much. (which most forms of averaging filters do). The exact amplitude is important to correct for variable perspective of offcenter objects.