Recent

Author Topic: efficiency problem  (Read 56638 times)

photor

  • Jr. Member
  • **
  • Posts: 80
Re: efficiency problem
« Reply #15 on: March 16, 2015, 06:14:52 am »
Following you and Marcov's suggestion, I have rewritten the function Multiply (as the function Mul in the attached code) by using pointers.
Now the same matrix multiplication only takes less than 2 seconds on my machine using -O3 in freepascal 2.6.4. That's a rather significant progress! Though it is still 20 times slower than Mathematica, which using multi-core processing and LAPACK or something like that.

Code: [Select]
Function Mul(Const mat1,mat2:Matrix):Matrix;
Var
 i,j,k:Word;
 sum:Extended;
 u,v:PExtended;
 mat:Matrix;
Begin
 For i:=0 to n-1 do
  For j:=0 to n-1 do mat[j,i]:=mat2[i,j];
 For i:=0 to n-1 do
  For j:=0 to n-1 do
  Begin
   sum:=0;u:=@mat1[i];v:=@mat[j];
   For k:=0 to n-1 do
   Begin
    sum+=u^*v^;
    Inc(u);Inc(v);
   End;
   Mul[i,j]:=sum;
  End;
End;

This extreme difference in efficiency is not credible.

I reckon mathematica uses the Strassen Algorithmus which reduces the runtime for quadratic matrix multiplications from N^3 to N^2.807. With N=1000 this would be a runtime reduction to 26% just by the algorithm.
There are even better algorithms (Coppersmith–Winograd) with N^2.37 but wikipedia says these are not practicable yet.

The major efficiency problem in your code is the inner loop and the access to mat2[k,j], where k is the loop variable. Array access is significantly faster if you just increment the index. Maybe you should consider to conjugate one of the matrices before multiplicaction.

And, by the way, I am getting a stack overflow here. You should consider to use dynamic arrays instead of static ones.

On the coprocessor: FPC normally uses th x87 but you may try SSE using the switch {$FPUTYPE SSE3}. Sometimes this produces faster code, but not always.

Also, try to make sure that matlab does not just cache previously calculated results.
http://de.mathworks.com/company/newsletters/articles/matlab-incorporates-lapack.html

Leledumbo

  • Hero Member
  • *****
  • Posts: 8835
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: efficiency problem
« Reply #16 on: March 16, 2015, 07:34:42 am »
That's one case we need array indexing optimization, which is currently computed on every access despite the access itself in the loop is incremental. In theory, what you did by hand should be possible to be done automagically by the compiler.

Eugene Loza

  • Hero Member
  • *****
  • Posts: 729
    • My games in Pascal
Re: efficiency problem
« Reply #17 on: March 16, 2015, 10:15:27 am »
I think it's the case of 'honest' calculation.

E.g. in my University our department had a very hard time with some math software... Maybe MathCad it was, I don't remember... When they solved third-order polynomial equation by approximation (correct answer, but slow) and Euler's method (fast, but wrong answer) they got different results... Finally we've found out that that math software derieved wrong cubic root result... My pascal program did the job.

Another example was double infinite integral for Quantum Field Department. No math software could handle it in days on rather modern computers (at the date). Only solution by my pascal program finally yielded the answer.

I bet the math soft is more stable and optimized these days. But sometimes you have to trust in the result. That's where you dive in to direct calculations and do everything honestly, without ambiguous optimization. Often it's better to calculate the result for a week and know you trust it, than get the result in seconds, but then spend days to analyze the source of the error.

I did the test of the similar task at my 1.2 GHz one-core Celeron.
Code: [Select]
{$R-}{$Q-}

type valuetype = single;

var M1,M2,m3:array[1..1000,1..1000] of valuetype;

procedure TForm1.Button1Click(Sender: TObject);
var i,j,k:integer;
    T:TDateTime;
    sum:valuetype;
begin
 for i:=1 to 1000 do
  for j:=1 to 1000 do begin
    M1[i,j]:={round(}random*50{)};
    M2[i,j]:={round(}random*50{)};
  end;
 T:=now;
 for i:=1 to 1000 do
  for j:=1 to 1000 do begin
    sum:=0;
    for k:=1 to 1000 do
      sum:=sum + M1[i,k]*M2[k,j];
    M3[i,j]:=sum;
  end;
  Button1.caption:=inttostr(round((now-T)*24*60*60*1000))+' ms';
end;

With the result
24 827 msSingle
37 832 msDouble
116 832 msExtended
51 894 msInteger
38 047 msWord

I was really astonished to see that Dobule performed better than Word...
« Last Edit: March 16, 2015, 10:17:35 am by Eugene Loza »
My FOSS games in FreePascal&CastleGameEngine: https://decoherence.itch.io/ (Sources: https://gitlab.com/EugeneLoza)

rtusrghsdfhsfdhsdfhsfdhs

  • Full Member
  • ***
  • Posts: 162
Re: efficiency problem
« Reply #18 on: March 16, 2015, 12:26:00 pm »
Just write it with C++ either with ICC or GCC and then statically link it. Simple solution.  8-)

photor

  • Jr. Member
  • **
  • Posts: 80
Re: efficiency problem
« Reply #19 on: March 16, 2015, 01:33:59 pm »
I don't think C++ does that work better than pascal.

Just write it with C++ either with ICC or GCC and then statically link it. Simple solution.  8-)

lagprogramming

  • Sr. Member
  • ****
  • Posts: 409
Re: efficiency problem
« Reply #20 on: March 16, 2015, 02:45:05 pm »
Quote
I was really astonished to see that Double performed better than Word...
   I think that's due to the fact that you add the "round" function to integer types. This conversion adds additional computation compared to plain floating point multiplications.
   On my CPU using double takes only 5% more time compared to using single type. Using extended is almost 5 times slower.
   What surprises me is that using native Integer on 32bit CPU appears to be a lot slower than using Word type.

Eugene Loza

  • Hero Member
  • *****
  • Posts: 729
    • My games in Pascal
Re: efficiency problem
« Reply #21 on: March 16, 2015, 03:00:34 pm »
I think that's due to the fact that you add the "round" function to integer types. This conversion adds additional computation compared to plain floating point multiplications.
I start counting 'time' after the matrices have been prepared. (where T:=now) So round doesn't affect it.
It's just a moment of revelation for me :o I always believed that integer operations are much faster than float...
My FOSS games in FreePascal&CastleGameEngine: https://decoherence.itch.io/ (Sources: https://gitlab.com/EugeneLoza)

Leledumbo

  • Hero Member
  • *****
  • Posts: 8835
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: efficiency problem
« Reply #22 on: March 16, 2015, 03:36:20 pm »
I don't think C++ does that work better than pascal.

Just write it with C++ either with ICC or GCC and then statically link it. Simple solution.  8-)
Language wise: no. Compiler wise: yes. GCC does have this indexing optimization, so is ICC, and AFAIK Clang too. Perhaps other BIG compilers as well.

Nitorami

  • Hero Member
  • *****
  • Posts: 605
Re: efficiency problem
« Reply #23 on: March 16, 2015, 07:15:29 pm »
Quote
Now the same matrix multiplication only takes less than 2 seconds on my machine using -O3 in freepascal 2.6.4. That's a rather significant progress!

I'm surprised. I rather thought that in combination with a rather expensive operation such as multiplication of floats, array indexing would hardly matter. It is not quite factor 5 but only factor 2 on my old machine, but still this is quite considerable, and the code seems to calculate the correct results.

What bugs me even more is that the initial matrix transpose operation does not seem to contribute at all. Replacing the element wise assignment

Code: [Select]
  For i:=0 to n-1 do for j:=0 to n-1 do mat[j,i]:=mat2[i,j];

by a unity assignment
Code: [Select]
  For i:=0 to n-1 do for j:=0 to n-1 do mat[j,i]:= 1.0; //wrong of course

or even a fast floodfill with zeroes
Code: [Select]
  filldword (mat, sizeof (mat) div 4, 0); //wrong of course

even makes the execution time larger, not smaller.

Edit: Replacing the pointer arithmetics in the inner loop by a normal array assignment makes your code only 20% slower. With FPC 3.1.1., it is the other way round: the normal array access is 20% faster. May just be another case of black magic....
« Last Edit: March 16, 2015, 07:28:04 pm by Nitorami »

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12645
  • FPC developer.
Re: efficiency problem
« Reply #24 on: March 16, 2015, 07:29:07 pm »
Try this (from memory, untested):

Code: [Select]
Function Mul(Const mat1,mat2:Matrix):Matrix;
Var
 i,j,k:Word;
 sum:Extended;
 u,v:PExtended;
 mat:Matrix;
 rowstride:integer;
Begin
 // check for smaller matrix here, must have at least 2 rows.
 rowstride:=ptrint(@mat[1,0])-ptrint(@mat[0,0])
 For i:=0 to n-1 do
  For j:=0 to n-1 do
  Begin
   sum:=0;u:=@mat1[i];v:=@mat[i,j];
   For k:=0 to n-1 do
   Begin
    sum+=u^*v^;
    Inc(u);Inc(pbyte(v),rowstride);
   End;
   Mul[i,j]:=sum;
  End;
End;

Nitorami

  • Hero Member
  • *****
  • Posts: 605
Re: efficiency problem
« Reply #25 on: March 16, 2015, 07:45:49 pm »
Throws an error 217 even if initialising mat with mat2.

DelphiFreak

  • Sr. Member
  • ****
  • Posts: 259
    • Fresh sound.
Re: efficiency problem
« Reply #26 on: March 16, 2015, 08:27:32 pm »

Maybe you also test with fpc 2.7.1 ?
And think of optimizing some code? I alway try to avoid to multiply something with 0.  :D Because we know the result already.

Here is a link to my little collection of different executables for scimark:
http://sourceforge.net/projects/scimarkfordelphi/

Have fun.
Linux Mint 22, Lazarus 4.0, Windows 11, Delphi 12 Athens, Delphi 13 Florence

argb32

  • Jr. Member
  • **
  • Posts: 89
    • Pascal IDE based on IntelliJ platform
Re: efficiency problem
« Reply #27 on: March 16, 2015, 11:17:43 pm »
This one:
Code: [Select]
Type
  Float = Double;
  PFloat = ^Float;
  Vector=Array[0..n-1] of Float;
  Matrix=Array[0..n-1] of Vector;
  PMatrix = ^Matrix;

procedure Multiply(Const mat1, mat2: PMatrix; var Result: PMatrix);
Var
  i, j, k: Word;
  u, v, x: PFloat;
Begin
  For i:=0 to n-1 do Begin
    u := @mat1^[i];
    For k:=0 to n-1 do begin
      v := @mat2^[k];
      x := @Result^[i];
      For j:=0 to n-1 do begin
        x^ += u^*v^;
        inc(v);
        inc(x);
      end;
      inc(u);
    End;
  end;
End;
is slightly faster then the latest version when Float = Extended and ~3 times faster when Float = Double. Of course compared to version with the same element type (Extended - Extended, Double - Double).
I had to use PMatrix to avoid EStackOverflow on Linux.

wp

  • Hero Member
  • *****
  • Posts: 13353
Re: efficiency problem
« Reply #28 on: March 16, 2015, 11:44:00 pm »
This is not correct: In matrix multiplication you have the multiply the row values of the first matrix with the column values of the second matrix. In this method, however, you alway multiply rows (or columns, depending how you look at the matrix).

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: efficiency problem
« Reply #29 on: March 17, 2015, 02:16:14 am »
This is not correct: In matrix multiplication you have the multiply the row values of the first matrix with the column values of the second matrix. In this method, however, you alway multiply rows (or columns, depending how you look at the matrix).

I think argb32's idea works if the second matrix is actually the result of transposition of the original matrix. It makes it possible to use SSE2 multiplication on these matrices, for instance.

 

TinyPortal © 2005-2018