Lazarus

Free Pascal => General => Topic started by: Muso on November 04, 2014, 02:23:57 pm

Title: performance problem with Free Pascal 2.6.x
Post by: Muso on November 04, 2014, 02:23:57 pm
As my old Delphi 7 does no longer work correctly with Win 7 profession 64bit, I switched to Free pascal with Lazarus. I encounter differences in the speed of the compiled programs. Programs using the same code are slower than with Delphi. Here is a simple routine:

Code: [Select]
procedure Filter(const A: double);
var ZHelp,Trapezoid : double;
    k,n, Number : integer : integer;
    ZArray,ZArrayFilter : array of double;
 
begin
 For n:= 1 to Number do
  begin
   ZHelp:= 0;
   For k:= 1 to Number do
    ZArray[k-1]:= ZValues[k-1] * A * exp(-Pi*power(A/1000*(XValues[k-1]-XValues[n-1]), 2));
   // integrate (trapezoidal rule)
   For k:= 1 to Number-1 do
   begin
    Trapezoid:= (abs(ZArray[k-1]-ZArray[k])/2 + Min(ZArray[k-1], ZArray[k]))
                * (XValues[k]-XValues[k-1]); //trapezoid  = triangle + rectangle
    ZHelp:= ZHelp + Trapezoid;
   end; //end for k
   ZArrayFilter[n-1]:= ZHelp / 1000;
 end; //end for n (Filter)
end;

My arrays have up to 30.000 values and Number is in the same region. For Number:=10800 Delphi needs 17 s while Free Pascal 2.6 needs 20 s (18% slower).  (I cannot compile using Delphi 7 anymore, so my times are hand-stopped.)

I read in this thread: http://free-pascal-general.1045716.n5.nabble.com/code-optimization-td2848157.html (http://free-pascal-general.1045716.n5.nabble.com/code-optimization-td2848157.html)
that FreePascal is much slower than Delphi because it has problems if the array size is not a power of 2. But this info is from 2010 and Free pascal 2.2.

So my question is what I can do to make the code at least as fast as on Delphi 7? I already use the compiler options
-Mdelphi -O3
and also tried
-Cfsse2 and -Cfsse3
without any gain in speed. I also read that there is an optimization called "fastmath" but Lazarus doesn't provide such a compiler option.

Any ideas?
Title: Re: performance problem with Free Pascal 2.6.x
Post by: marcov on November 04, 2014, 02:32:42 pm
Not the array size, but the ELEMENT size a power of two. But you seem to use doubles which have size 8, which is 2**3, so a power of 2.

The  posts you seem to indicate that Delphi does some form of strength reduction in such case to avoid repeated multiplications (I can't remember seeing D7 do that, but that is what it seems to say).

I don't really see anything that can be improved quickly, so you would probably have to compare the generated assembler.

Or just try the development version.

Title: Re: performance problem with Free Pascal 2.6.x
Post by: Martin_fr on November 04, 2014, 03:43:15 pm
Look at:
http://bugs.freepascal.org/view.php?id=10275

Afaik fpc trunk optimizes more, but I do not know how much of this issue is covered.

One think you can try (if not solved otherwise):

ZArray[k] / ZArray[k-1]

before the loop:
ZArrayElementPointer = @ZArray[0]

in the loop
inc(ZArrayElementPointer);

Title: Re: performance problem with Free Pascal 2.6.x
Post by: marcov on November 04, 2014, 03:49:36 pm
Afaik SSA is still a private branch Florian works on from time to time.
Title: Re: performance problem with Free Pascal 2.6.x
Post by: Never on November 04, 2014, 04:12:16 pm
replace all var / some_num with coresponting var * and see if this makes a dif
ex: A/1000=A*0,001 
Edit***: also [ A/1000 ] is a const for its place so calculate once before the for
Title: Re: performance problem with Free Pascal 2.6.x
Post by: wp on November 04, 2014, 04:43:54 pm
Quote
As my old Delphi 7 does no longer work correctly with Win 7 profession 64bi
I don't want to advertise Delphi here, but I have D7 running on Win 7-64bit, and it is running fine. The only thing to take care of is not to install it to Program Files, but to any other folder to which you have write access - I have it in C:\Delphi7.
Title: Re: performance problem with Free Pascal 2.6.x
Post by: marcov on November 04, 2014, 04:49:20 pm
Quote
As my old Delphi 7 does no longer work correctly with Win 7 profession 64bi
I don't want to advertise Delphi here, but I have D7 running on Win 7-64bit, and it is running fine. The only thing to take care of is not to install it to Program Files, but to any other folder to which you have write access - I have it in C:\Delphi7.

And install the help supplement for the .hlp help?
Title: Re: performance problem with Free Pascal 2.6.x
Post by: engkin on November 04, 2014, 05:07:37 pm
A side note: If I'm right that:
abs(A-B)/2 + Min(A, B)

equals:
(A+B)/2

This part from the second inner loop:
Code: [Select]
    Trapezoid:= (abs(ZArray[k-1]-ZArray[k])/2 + Min(ZArray[k-1], ZArray[k]))
                * (XValues[k]-XValues[k-1]); //trapezoid  = triangle + rectangle

Could be changed to:
Code: [Select]
    Trapezoid:= (ZArray[k-1]+ZArray[k])/2
                * (XValues[k]-XValues[k-1]); //trapezoid  = triangle + rectangle
Title: Re: performance problem with Free Pascal 2.6.x
Post by: wp on November 04, 2014, 05:09:01 pm
Quote
help supplement for the .hlp help
Yes, if I remember correctly hlp files don't work in Win7 out of the box, but MS has WinHlp32 as a separate download (http://www.microsoft.com/en-us/download/details.aspx?id=91), I guess that's what you mean by "help supplement". Working perfectly.
Title: Re: performance problem with Free Pascal 2.6.x
Post by: Never on November 04, 2014, 06:02:16 pm
@engkin ***thumbs up***
Title: Re: performance problem with Free Pascal 2.6.x
Post by: Muso on November 04, 2014, 06:28:13 pm
Quote
abs(A-B)/2 + Min(A, B)
equals:
(A+B)/2

Oh, stupid me. Many thanks!!! Now the whole routine is more than 10 % faster. (is now as fast as Delphi with the old code (hand stopped))

Quote
[ A/1000 ] is a const

Thanks! I have overseen this.

Quote
running on Win 7-64bit, and it is running fine. The only thing to take care of is not to install it to Program Files
MS has WinHlp32 as a separate download

many thanks!!! Now I can use Delphi under Win 7.

Summary: I have learned how to improve the logic of the coding and how to run Delphi under Win 7 but Free pascal is still slower than Delphi. Maybe there is something one can do?

I will nevertheless stay with Lazarus/Free pascal. the IDE is more comfortable in my opinion and the killer feature for me is the platform independence.

Many thanks again to all people who helped.
Title: Re: performance problem with Free Pascal 2.6.x
Post by: Muso on November 04, 2014, 06:31:14 pm
One think you can try (if not solved otherwise):

ZArray[k] / ZArray[k-1]

before the loop:
ZArrayElementPointer = @ZArray[0]

in the loop
inc(ZArrayElementPointer);

Thanks. I must admit that I don't understand what you mean. (I never worked yet with pointers in Pascal) How would the resulting code look and why might a pointer be a better solution?
Title: Re: performance problem with Free Pascal 2.6.x
Post by: Martin_fr on November 04, 2014, 07:03:27 pm
instead of
Code: [Select]
  for i = 0 to 100 do begin
    x := x + someArrayOfDouble[i];
  end;

do
Code: [Select]
  var p: ^Double; // pointer to array element type

  p := @someArrayOfDouble[0]; // point to first element
  for i = 0 to 100 do begin
    x := x + p^;
    inc(p); // point to next element
  end;

I am not sure, if and/or when fpc does optimize this, but it may not, on not always.:

In the first loop, the byte address of each array element has to be calculated as
  byteAddressOfDouble = index *sizeof(double)

In the 2nd loop, this multiplication is no longer needed.
Title: Re: performance problem with Free Pascal 2.6.x
Post by: engkin on November 04, 2014, 07:19:09 pm
@Never, thanks!

@Musa, your code is fine and easy to understand. That's how I write mine. I know am not answering your question directly, but still any speed gained here might help in general, so I hope you don't mind.

The first inner loop can accept some change, as well. For instance, instead of Power in:
Code: [Select]
    ZArray[k-1]:= ZValues[k-1] * A * exp(-Pi*power(A/1000*(XValues[k-1]-XValues[n-1]), 2));

you can use IntPower for integer exponent values. Power and IntPower have the same accuracy:
Code: [Select]
    ZArray[k-1]:= ZValues[k-1] * A * exp(-Pi*intpower(A/1000*(XValues[k-1]-XValues[n-1]), 2));

In case of 2, I think a simple multiplication should be faster with (or without?) optimization.
Code: [Select]
  Ad1000 := A/1000;
..
    deltaX := Ad1000*(XValues[k-1]-XValues[n-1]);
    ZArray[k-1]:= ZValues[k-1] * A * exp(-Pi*deltaX*deltaX);
But it does not have the same accuracy.
Title: Re: performance problem with Free Pascal 2.6.x
Post by: lagprogramming on November 04, 2014, 08:05:34 pm
instead of
Code: [Select]
  for i = 0 to 100 do begin
    x := x + someArrayOfDouble[i];
  end;

do
Code: [Select]
  var p: ^Double; // pointer to array element type

  p := @someArrayOfDouble[0]; // point to first element
  for i = 0 to 100 do begin
    x := x + p^;
    inc(p); // point to next element
  end;

I am not sure, if and/or when fpc does optimize this, but it may not, on not always.:

In the first loop, the byte address of each array element has to be calculated as
  byteAddressOfDouble = index *sizeof(double)

In the 2nd loop, this multiplication is no longer needed.


   I understand your approach but you might have a surprise regarding modern processors. The second code proposed by you will run slower than the first one on many new processors.  I have this problem when trying to optimize loops for zero comparisons. It's very tricky. :)
Title: Re: performance problem with Free Pascal 2.6.x
Post by: Martin_fr on November 04, 2014, 09:02:11 pm
I know modern CPU do a lot of pre-fetch and prediction stuff (I don't know the details)

The above code, has for many people been faster. Even with on a modern CPU, I would expect it at worst to be of equal speed.

But then I am no expect on this topic.

Out of interest which modern cpu feature did you think of?
- "branch prediction" should not be affected (the branch at the end of the loop stays the same)
- data-fetching/caching/cache lines should also be the same, as the same data, as accessed in the same order.



Title: Re: performance problem with Free Pascal 2.6.x
Post by: lagprogramming on November 04, 2014, 10:58:46 pm
@Martin_fr
   I'm not an expert either but regarding the presented code I've noticed:
1)If I change "For n:= 1 to Number do" to "For n:=0 to Number-1 do" and change "[n-1]" with "[n]" inside the loop
and
If I change "For k:= 1 to Number do" to "For k:=0 to Number-1 do" and change "[k-1]" with "[k]" inside the loop
then the code will be executed slower, although to me, it should run faster due to the removal of "-1"s. Probably it's due to the initial assignment at the beginning of the "for" loops, I didn't looked at the code produced.

2)Situations like the following one are known. If I completely remove variable Trapezoid and replace
Code: [Select]
ZHelp:= ZHelp + Trapezoid;with
Code: [Select]
ZHelp:= ZHelp + ((abs(ZArray[k-1]-ZArray[k])/2 + Min(ZArray[k-1], ZArray[k])) * (XValues[k]-XValues[k-1]));again, the code is executed slower.

3)As far as I remember I've been surprised to notice that declaring a new variable "XValuesdouble:double;"
and changing
Code: [Select]
For k:=0 to Number-1 do
    ZArray[k]:= ZValues[k] * A * exp(-Pi*power(A/1000*(XValues[k]-XValues[n]), 2));

with

Code: [Select]
XValuesdouble:=XValues[n];
For k:=0 to Number-1 do
    ZArray[k]:= ZValues[k] * A * exp(-Pi*power(A/1000*(XValues[k]-XValuesdouble), 2));

again lead to a decrease of code execution speed.

Using level 3 optimization level.
However, It's a very strange CPU I have here, many things don't work as expected.
Title: Re: performance problem with Free Pascal 2.6.x
Post by: Martin_fr on November 05, 2014, 12:22:22 am
It may not be your CPU, if cou change variables, you may change, how fpc allocates registers.

And if an important value, no longer stays in a register, then execution speed drops.

As far the for loop, I would have to look at the generated code.

As for my example with the pointer, see the bug I linked, it was used as a fixed there, and worked well.
Title: Re: performance problem with Free Pascal 2.6.x
Post by: Leledumbo on November 05, 2014, 10:08:41 am
instead of
Code: [Select]
  for i = 0 to 100 do begin
    x := x + someArrayOfDouble[i];
  end;

do
Code: [Select]
  var p: ^Double; // pointer to array element type

  p := @someArrayOfDouble[0]; // point to first element
  for i = 0 to 100 do begin
    x := x + p^;
    inc(p); // point to next element
  end;

I am not sure, if and/or when fpc does optimize this, but it may not, on not always.:

In the first loop, the byte address of each array element has to be calculated as
  byteAddressOfDouble = index *sizeof(double)

In the 2nd loop, this multiplication is no longer needed.


   I understand your approach but you might have a surprise regarding modern processors. The second code proposed by you will run slower than the first one on many new processors.  I have this problem when trying to optimize loops for zero comparisons. It's very tricky. :)
No, the second one will always run faster ATM. FPC simply doesn't optimize such access yet (not even in trunk AFAIR). Moreover, for loops are complex. It's much slower than while / repeat - until. This should help more:
Code: [Select]
  var p,q: ^Double; // pointer to array element type

  p := @someArrayOfDouble[0]; // point to first element
  q := @someArrayOfDouble[100]; // point to last element
  while p <= q do begin
    x := x + p^;
    inc(p); // point to next element
  end;

Title: Re: performance problem with Free Pascal 2.6.x
Post by: lagprogramming on November 05, 2014, 05:41:30 pm
   Because it might be related to the subject, the project attached shows me the following values:
qq1 31251 ms.
qq2 21762 ms.
qq1_nested 22134 ms.
qq2_nested 31257 ms.
   Something has to be wrong because qq1 is the same as qq1_nested and qq2 is the same as qq2_nested.
Am I missing something regarding the code? If not, do you have the similar results?
   I might be too tired now. %)
Title: Re: performance problem with Free Pascal 2.6.x
Post by: prino on November 09, 2014, 05:19:39 pm
A side note: If I'm right that:
abs(A-B)/2 + Min(A, B)

equals:
(A+B)/2
Sadly, that's not always true. The first equation will not overflow for very large integers, the second will!
Title: Re: performance problem with Free Pascal 2.6.x
Post by: Martin_fr on November 09, 2014, 05:42:01 pm
   Something has to be wrong because qq1 is the same as qq1_nested and qq2 is the same as qq2_nested.
Am I missing something regarding the code? If not, do you have the similar results?
   I might be too tired now. %)

Use -al as compiler option, and compare generated assembler

Btw, they are not the same: Nested functions have a hidden parameter containing a pointer to the stackframe of the outer func/proc. Even so this is not used, it is present, and therefore it may affect how registers are allocated.
Plus your measuring includes the time to call the procedure, and that also includes passing the hidden param.
Title: Re: performance problem with Free Pascal 2.6.x
Post by: engkin on November 10, 2014, 01:08:51 am
A side note: If I'm right that:
abs(A-B)/2 + Min(A, B)

equals:
(A+B)/2
Sadly, that's not always true. The first equation will not overflow for very large integers, the second will!
Prino, welcome to Lazarus/FPC forum.  :)

Although A and B in this case are of type double, you still have a valid point. With integer numbers in general, would it be different if B were a negative number? Don't you think that A-B would overflow, if A and B were of different signs?
Title: Re: performance problem with Free Pascal 2.6.x
Post by: lagprogramming on November 10, 2014, 05:17:54 pm
   Something has to be wrong because qq1 is the same as qq1_nested and qq2 is the same as qq2_nested.
Am I missing something regarding the code? If not, do you have the similar results?
   I might be too tired now. %)

Use -al as compiler option, and compare generated assembler

Btw, they are not the same: Nested functions have a hidden parameter containing a pointer to the stackframe of the outer func/proc. Even so this is not used, it is present, and therefore it may affect how registers are allocated.
Plus your measuring includes the time to call the procedure, and that also includes passing the hidden param.

qq1 31251 ms.
qq2 21762 ms.
qq1_nested 22134 ms.
qq2_nested 31257 ms.

Thank you. It appears it's not related to the subject so I won't continue, however:
1) The assembler code appears fine.
2) The idea wasn't why the nested procedures were slower, but why one of them was much faster(qq1_nested).
3) Most probably I've hit the "slow if then else" anomaly again. I think it may be related to the alignment(memory addresses) of the functions/procedures, conditional jumps...  :-\ I don't know exactly...
TinyPortal © 2005-2018