Lazarus

Programming => General => Topic started by: photor on March 15, 2015, 11:57:18 am

Title: efficiency problem
Post by: photor on March 15, 2015, 11:57:18 am
In my work, I use frequently big matrix multiplications, so its efficiency is vital. In the following simple code, I do a multiplication of two 1000*1000 square matrices, which takes about 11 seconds on my mac book pro with a 2.7 GHz dual core i7 cpu. In comparison, the same matrix multiplication takes only 0.07 second by a simple Mathematica code (I use the version 8.0.4) on the same machine. What's wrong of fpc? And any suggestions? BTW, in both cases, the platform I use is Windows 8.1. Sorry for my poor English.


Code: [Select]
program matmultiply;

Uses SysUtils;

Const
 n=1000;

Type
 Vector=Array[0..n-1] of Extended;
 Matrix=Array[0..n-1] of Vector;

Var
 StartMS:double;
 d,dd:Matrix;

Function Multiply(Const mat1,mat2:Matrix):Matrix;
Var
 i,j,k:Word;
 sum:Extended;
 v:Vector;
Begin
 For i:=0 to n-1 do
  For j:=0 to n-1 do
  Begin
   sum:=0;v:=mat1[i];
   For k:=0 to n-1 do sum+=v[k]*mat2[k,j];
   Multiply[i,j]:=sum;
  End;
End;

Function ChebyPts:Vector;
Var
 i:Word;
Begin
 For i:=0 to n-1 do ChebyPts[i]:=Cos(i*Pi/(n-1));
End;

Function ChebyDiff:Matrix;
Var
 i,j:Word;
 c,CP:Vector;
Begin
 For i:=1 to n-2 do c[i]:=1;
 c[0]:=2;c[n-1]:=2;
 CP:=ChebyPts;
 For i:=0 to n-1 do
  For j:=0 to n-1 do
   if i<>j then ChebyDiff[i,j]:=(c[i]/c[j])*Cos((i+j)*Pi)/(CP[i]-CP[j])
   else if i=0 then ChebyDiff[i,j]:=(2*(n-1)*(n-1)+1)/6
   else if i=n-1 then ChebyDiff[i,j]:=-(2*(n-1)*(n-1)+1)/6
   else ChebyDiff[i,j]:=-0.5*CP[j]/(1-CP[j]*CP[j]);
End;

begin
  d:=ChebyDiff;
  StartMS:=timestamptomsecs(datetimetotimestamp(now));
  dd:=Multiply(d,d);
  Writeln('Serially processed in '+floattostr(timestamptomsecs(datetimetotimestamp(now))-StartMS)+'ms.');
end.

Title: Re: efficiency problem
Post by: marcov on March 15, 2015, 12:26:28 pm
Using the exact same looping code in Mathematica with the most precise but slow extended type? Or do you use in built-in optimized vector and matrix routines?

Anyway, a simple thing to try, store mat2[k,j] in a variable  before the for k loop. A pointer construct for the most inner loop might also help.

Code: [Select]
Function Multiply(Const mat1,mat2:Matrix):Matrix;
Var
 i,j,k:Word;
 sum:Extended;
 v:Vector;
 mat2value:extended;
 pv : PExtended;  // define as ^extended if not known
 pvend:pextended;
Begin
 For i:=0 to n-1 do
   begin
     v:=mat1[i];   // invariant in j loop.
  For j:=0 to n-1 do
  Begin
   sum:=0; mat2value:=mat2[k,j];
   pv:=@v[0]; pvend:=@v[n-1];
   while (pv<=pvend) do
     begin
       sum:=sum+pv^*mat2value;
       inc(pv);
     end;
   Multiply[i,j]:=sum;
  end;
  End;
End;
Title: Re: efficiency problem
Post by: Basile B. on March 15, 2015, 12:36:03 pm
Are you sure that the Vector elements must be of type Extended ? if you use Single The FP coprocessor wil still process at 80 bit but the data (heap) will be physically stored as 32 bit, which will reduce the processing time by 50 %.

The time delta could be explained by the way mathlab processes (eg SSE optimized matrix multiplication, maybe even open compute ?). You wont be able to get such results without using an optimized assembly routine working with the SSE instruction set (or a special math library). FP is slow.

Pascal is not a domain specific lang and to write such code is not necessarly easy for a scientist while with Matlab you don't have to mess with these tech. details since everything is under the hood.
Title: Re: efficiency problem
Post by: Leledumbo on March 15, 2015, 12:37:18 pm
I believe Mathematica (like MATLAB and Octave) uses parallelization whenever the engine sees a multicore environment. FPC doesn't do that yet. Btw, FPC version and optimization options play much role in floating point intensive programs like this. Your program runs 20 seconds on my Core i5-4200U when compiled plain, but only 11 seconds when I compile with full optimizations (on FPC 3.1.1, 2.6.4 doesn't have most of them) such as: -CX -XXs -O4 -CpCOREAVX2 -OpCOREAVX2 -CfAVX2. Depending on your needs -OoFASTMATH could help, too by reducing precision.
Title: Re: efficiency problem
Post by: Blaazen on March 15, 2015, 12:52:08 pm
Also, divisioin by 6 may be slower that multiply by 0.166666667. I don't know if FPC already optimalizes it.
Title: Re: efficiency problem
Post by: photor on March 15, 2015, 01:21:48 pm
The documentation of Mathematica says it works at extended precision by default, which I understand as using the FPU by default. I am not sure if my understanding is exactly correct.
And your code may helps a little, but ... Mathematica still works incredibly much faster than fpc ... I don't know if the situation is similar for C, C++ or Fortran.

Using the exact same looping code in Mathematica with the most precise but slow extended type? Or do you use in built-in optimized vector and matrix routines?

Anyway, a simple thing to try, store mat2[k,j] in a variable  before the for k loop. A pointer construct for the most inner loop might also help.

Code: [Select]
Function Multiply(Const mat1,mat2:Matrix):Matrix;
Var
 i,j,k:Word;
 sum:Extended;
 v:Vector;
 mat2value:extended;
 pv : PExtended;  // define as ^extended if not known
 pvend:pextended;
Begin
 For i:=0 to n-1 do
   begin
     v:=mat1[i];   // invariant in j loop.
  For j:=0 to n-1 do
  Begin
   sum:=0; mat2value:=mat2[k,j];
   pv:=@v[0]; pvend:=@v[n-1];
   while (pv<=pvend) do
     begin
       sum:=sum+pv^*mat2value;
       inc(pv);
     end;
   Multiply[i,j]:=sum;
  end;
  End;
End;
Title: Re: efficiency problem
Post by: Nitorami on March 15, 2015, 01:51:21 pm
This extreme difference in efficiency is not credible.

I reckon mathematica uses the Strassen Algorithmus which reduces the runtime for quadratic matrix multiplications from N^3 to N^2.807. With N=1000 this would be a runtime reduction to 26% just by the algorithm.
There are even better algorithms (Coppersmith–Winograd) with N^2.37 but wikipedia says these are not practicable yet.

The major efficiency problem in your code is the inner loop and the access to mat2[k,j], where k is the loop variable. Array access is significantly faster if you just increment the index. Maybe you should consider to conjugate one of the matrices before multiplicaction.

And, by the way, I am getting a stack overflow here. You should consider to use dynamic arrays instead of static ones.

On the coprocessor: FPC normally uses th x87 but you may try SSE using the switch {$FPUTYPE SSE3}. Sometimes this produces faster code, but not always.

Also, try to make sure that matlab does not just cache previously calculated results.
http://de.mathworks.com/company/newsletters/articles/matlab-incorporates-lapack.html


Title: Re: efficiency problem
Post by: photor on March 15, 2015, 02:10:21 pm
This is for the purpose of scientific calculation, so at least double precision is needed. And I have tried to re-compile my program using double precision instead of the extended one, but the effect is negligible.
If Mathematica is really using FPU to achieve the extended precision, then there is no SSE or SSE2 optimization. I don't know what open compute is.

Are you sure that the Vector elements must be of type Extended ? if you use Single The FP coprocessor wil still process at 80 bit but the data (heap) will be physically stored as 32 bit, which will reduce the processing time by 50 %.

The time delta could be explained by the way mathlab processes (eg SSE optimized matrix multiplication, maybe even open compute ?). You wont be able to get such results without using an optimized assembly routine working with the SSE instruction set (or a special math library). FP is slow.

Pascal is not a domain specific lang and to write such code is not necessarly easy for a scientist while with Matlab you don't have to mess with these tech. details since everything is under the hood.
Title: Re: efficiency problem
Post by: Leledumbo on March 15, 2015, 02:15:13 pm
The documentation of Mathematica says it works at extended precision by default, which I understand as using the FPU by default.
On a platform with FPU support, FPC will use FPU. x86(_64) FPU precisions consist of single (32-bit), double (64-bit) and extended (80-bit) precision. Your program when compiled on my machine using:
Rounded, the empiric difference is 5 : 10 : 11.5. Empiric difference in bit size is 2 : 4 : 5, multiplied by 2.5 to compare with above time results in 5 : 10 : 12.5. Therefore, the difference is speed of these types are roughly appropriate to the bit size.
And, by the way, I am getting a stack overflow here. You should consider to use dynamic arrays instead of static ones.
Same here, I need to convert the code to dynamic array first.
Title: Re: efficiency problem
Post by: photor on March 15, 2015, 02:16:38 pm
Thanks for so many warm replies  :)
I will consider all these points, try to optimize my code, and let you know the result.

This extreme difference in efficiency is not credible.

I reckon mathematica uses the Strassen Algorithmus which reduces the runtime for quadratic matrix multiplications from N^3 to N^2.807. With N=1000 this would be a runtime reduction to 26% just by the algorithm.
There are even better algorithms (Coppersmith–Winograd) with N^2.37 but wikipedia says these are not practicable yet.

The major efficiency problem in your code is the inner loop and the access to mat2[k,j], where k is the loop variable. Array access is significantly faster if you just increment the index. Maybe you should consider to conjugate one of the matrices before multiplicaction.

And, by the way, I am getting a stack overflow here. You should consider to use dynamic arrays instead of static ones.

On the coprocessor: FPC normally uses th x87 but you may try SSE using the switch {$FPUTYPE SSE3}. Sometimes this produces faster code, but not always.

Also, try to make sure that matlab does not just cache previously calculated results.
http://de.mathworks.com/company/newsletters/articles/matlab-incorporates-lapack.html
Title: Re: efficiency problem
Post by: Nitorami on March 15, 2015, 02:24:50 pm
You may wish to try an optimised library which is available for Delphi so it should work with FPC.

http://www.optivec.com/
Title: Re: efficiency problem
Post by: wp on March 15, 2015, 03:09:40 pm
For the calculation of 1 element of the 1000x1000 result matrix we need 1000 multiplications and additions. For the full matrix computation we need 1000x1000x1000 = 1E9 operations. In the era of GigaHertz processors and assuming a few clock cycles per math operation, the execution time per multiplication/addition must be in the order of 1 ns = 1E-9 s. Therefore the overall calculation time (in one thread) must be in the order of seconds - your FPC result is very reasonable.

I tried you program on Delphi XE2 and it runs for 11 seconds on my machine, exactly the same time as Lazarus.
Title: Re: efficiency problem
Post by: Bart on March 15, 2015, 03:27:29 pm
@Marcov: your code produces an entirely different result matrix than the original one.

Bart
Title: Re: efficiency problem
Post by: derek.john.evans on March 15, 2015, 03:54:42 pm
Some interesting reading:

http://stackoverflow.com/questions/6058139/why-is-matlab-so-fast-in-matrix-multiplication

http://www.mathematica-journal.com/issue/v9i1/contents/newin5/newin5_2.html

http://en.wikipedia.org/wiki/Sparse_matrix
Title: Re: efficiency problem
Post by: marcov on March 15, 2015, 04:13:13 pm
@Marcov: your code produces an entirely different result matrix than the original one.

Yeah, the mat2 also uses index k, I misread that as [i,j].

Title: Re: efficiency problem
Post by: photor on March 16, 2015, 06:14:52 am
Following you and Marcov's suggestion, I have rewritten the function Multiply (as the function Mul in the attached code) by using pointers.
Now the same matrix multiplication only takes less than 2 seconds on my machine using -O3 in freepascal 2.6.4. That's a rather significant progress! Though it is still 20 times slower than Mathematica, which using multi-core processing and LAPACK or something like that.

Code: [Select]
Function Mul(Const mat1,mat2:Matrix):Matrix;
Var
 i,j,k:Word;
 sum:Extended;
 u,v:PExtended;
 mat:Matrix;
Begin
 For i:=0 to n-1 do
  For j:=0 to n-1 do mat[j,i]:=mat2[i,j];
 For i:=0 to n-1 do
  For j:=0 to n-1 do
  Begin
   sum:=0;u:=@mat1[i];v:=@mat[j];
   For k:=0 to n-1 do
   Begin
    sum+=u^*v^;
    Inc(u);Inc(v);
   End;
   Mul[i,j]:=sum;
  End;
End;

This extreme difference in efficiency is not credible.

I reckon mathematica uses the Strassen Algorithmus which reduces the runtime for quadratic matrix multiplications from N^3 to N^2.807. With N=1000 this would be a runtime reduction to 26% just by the algorithm.
There are even better algorithms (Coppersmith–Winograd) with N^2.37 but wikipedia says these are not practicable yet.

The major efficiency problem in your code is the inner loop and the access to mat2[k,j], where k is the loop variable. Array access is significantly faster if you just increment the index. Maybe you should consider to conjugate one of the matrices before multiplicaction.

And, by the way, I am getting a stack overflow here. You should consider to use dynamic arrays instead of static ones.

On the coprocessor: FPC normally uses th x87 but you may try SSE using the switch {$FPUTYPE SSE3}. Sometimes this produces faster code, but not always.

Also, try to make sure that matlab does not just cache previously calculated results.
http://de.mathworks.com/company/newsletters/articles/matlab-incorporates-lapack.html
Title: Re: efficiency problem
Post by: Leledumbo on March 16, 2015, 07:34:42 am
That's one case we need array indexing optimization, which is currently computed on every access despite the access itself in the loop is incremental. In theory, what you did by hand should be possible to be done automagically by the compiler.
Title: Re: efficiency problem
Post by: Eugene Loza on March 16, 2015, 10:15:27 am
I think it's the case of 'honest' calculation.

E.g. in my University our department had a very hard time with some math software... Maybe MathCad it was, I don't remember... When they solved third-order polynomial equation by approximation (correct answer, but slow) and Euler's method (fast, but wrong answer) they got different results... Finally we've found out that that math software derieved wrong cubic root result... My pascal program did the job.

Another example was double infinite integral for Quantum Field Department. No math software could handle it in days on rather modern computers (at the date). Only solution by my pascal program finally yielded the answer.

I bet the math soft is more stable and optimized these days. But sometimes you have to trust in the result. That's where you dive in to direct calculations and do everything honestly, without ambiguous optimization. Often it's better to calculate the result for a week and know you trust it, than get the result in seconds, but then spend days to analyze the source of the error.

I did the test of the similar task at my 1.2 GHz one-core Celeron.
Code: [Select]
{$R-}{$Q-}

type valuetype = single;

var M1,M2,m3:array[1..1000,1..1000] of valuetype;

procedure TForm1.Button1Click(Sender: TObject);
var i,j,k:integer;
    T:TDateTime;
    sum:valuetype;
begin
 for i:=1 to 1000 do
  for j:=1 to 1000 do begin
    M1[i,j]:={round(}random*50{)};
    M2[i,j]:={round(}random*50{)};
  end;
 T:=now;
 for i:=1 to 1000 do
  for j:=1 to 1000 do begin
    sum:=0;
    for k:=1 to 1000 do
      sum:=sum + M1[i,k]*M2[k,j];
    M3[i,j]:=sum;
  end;
  Button1.caption:=inttostr(round((now-T)*24*60*60*1000))+' ms';
end;

With the result
24 827 msSingle
37 832 msDouble
116 832 msExtended
51 894 msInteger
38 047 msWord

I was really astonished to see that Dobule performed better than Word...
Title: Re: efficiency problem
Post by: rtusrghsdfhsfdhsdfhsfdhs on March 16, 2015, 12:26:00 pm
Just write it with C++ either with ICC or GCC and then statically link it. Simple solution.  8-)
Title: Re: efficiency problem
Post by: photor on March 16, 2015, 01:33:59 pm
I don't think C++ does that work better than pascal.

Just write it with C++ either with ICC or GCC and then statically link it. Simple solution.  8-)
Title: Re: efficiency problem
Post by: lagprogramming on March 16, 2015, 02:45:05 pm
Quote
I was really astonished to see that Double performed better than Word...
   I think that's due to the fact that you add the "round" function to integer types. This conversion adds additional computation compared to plain floating point multiplications.
   On my CPU using double takes only 5% more time compared to using single type. Using extended is almost 5 times slower.
   What surprises me is that using native Integer on 32bit CPU appears to be a lot slower than using Word type.
Title: Re: efficiency problem
Post by: Eugene Loza on March 16, 2015, 03:00:34 pm
I think that's due to the fact that you add the "round" function to integer types. This conversion adds additional computation compared to plain floating point multiplications.
I start counting 'time' after the matrices have been prepared. (where T:=now) So round doesn't affect it.
It's just a moment of revelation for me :o I always believed that integer operations are much faster than float...
Title: Re: efficiency problem
Post by: Leledumbo on March 16, 2015, 03:36:20 pm
I don't think C++ does that work better than pascal.

Just write it with C++ either with ICC or GCC and then statically link it. Simple solution.  8-)
Language wise: no. Compiler wise: yes. GCC does have this indexing optimization, so is ICC, and AFAIK Clang too. Perhaps other BIG compilers as well.
Title: Re: efficiency problem
Post by: Nitorami on March 16, 2015, 07:15:29 pm
Quote
Now the same matrix multiplication only takes less than 2 seconds on my machine using -O3 in freepascal 2.6.4. That's a rather significant progress!

I'm surprised. I rather thought that in combination with a rather expensive operation such as multiplication of floats, array indexing would hardly matter. It is not quite factor 5 but only factor 2 on my old machine, but still this is quite considerable, and the code seems to calculate the correct results.

What bugs me even more is that the initial matrix transpose operation does not seem to contribute at all. Replacing the element wise assignment

Code: [Select]
  For i:=0 to n-1 do for j:=0 to n-1 do mat[j,i]:=mat2[i,j];

by a unity assignment
Code: [Select]
  For i:=0 to n-1 do for j:=0 to n-1 do mat[j,i]:= 1.0; //wrong of course

or even a fast floodfill with zeroes
Code: [Select]
  filldword (mat, sizeof (mat) div 4, 0); //wrong of course

even makes the execution time larger, not smaller.

Edit: Replacing the pointer arithmetics in the inner loop by a normal array assignment makes your code only 20% slower. With FPC 3.1.1., it is the other way round: the normal array access is 20% faster. May just be another case of black magic....
Title: Re: efficiency problem
Post by: marcov on March 16, 2015, 07:29:07 pm
Try this (from memory, untested):

Code: [Select]
Function Mul(Const mat1,mat2:Matrix):Matrix;
Var
 i,j,k:Word;
 sum:Extended;
 u,v:PExtended;
 mat:Matrix;
 rowstride:integer;
Begin
 // check for smaller matrix here, must have at least 2 rows.
 rowstride:=ptrint(@mat[1,0])-ptrint(@mat[0,0])
 For i:=0 to n-1 do
  For j:=0 to n-1 do
  Begin
   sum:=0;u:=@mat1[i];v:=@mat[i,j];
   For k:=0 to n-1 do
   Begin
    sum+=u^*v^;
    Inc(u);Inc(pbyte(v),rowstride);
   End;
   Mul[i,j]:=sum;
  End;
End;
Title: Re: efficiency problem
Post by: Nitorami on March 16, 2015, 07:45:49 pm
Throws an error 217 even if initialising mat with mat2.
Title: Re: efficiency problem
Post by: DelphiFreak on March 16, 2015, 08:27:32 pm

Maybe you also test with fpc 2.7.1 ?
And think of optimizing some code? I alway try to avoid to multiply something with 0.  :D Because we know the result already.

Here is a link to my little collection of different executables for scimark:
http://sourceforge.net/projects/scimarkfordelphi/ (http://sourceforge.net/projects/scimarkfordelphi/)

Have fun.
Title: Re: efficiency problem
Post by: argb32 on March 16, 2015, 11:17:43 pm
This one:
Code: [Select]
Type
  Float = Double;
  PFloat = ^Float;
  Vector=Array[0..n-1] of Float;
  Matrix=Array[0..n-1] of Vector;
  PMatrix = ^Matrix;

procedure Multiply(Const mat1, mat2: PMatrix; var Result: PMatrix);
Var
  i, j, k: Word;
  u, v, x: PFloat;
Begin
  For i:=0 to n-1 do Begin
    u := @mat1^[i];
    For k:=0 to n-1 do begin
      v := @mat2^[k];
      x := @Result^[i];
      For j:=0 to n-1 do begin
        x^ += u^*v^;
        inc(v);
        inc(x);
      end;
      inc(u);
    End;
  end;
End;
is slightly faster then the latest version when Float = Extended and ~3 times faster when Float = Double. Of course compared to version with the same element type (Extended - Extended, Double - Double).
I had to use PMatrix to avoid EStackOverflow on Linux.
Title: Re: efficiency problem
Post by: wp on March 16, 2015, 11:44:00 pm
This is not correct: In matrix multiplication you have the multiply the row values of the first matrix with the column values of the second matrix. In this method, however, you alway multiply rows (or columns, depending how you look at the matrix).
Title: Re: efficiency problem
Post by: engkin on March 17, 2015, 02:16:14 am
This is not correct: In matrix multiplication you have the multiply the row values of the first matrix with the column values of the second matrix. In this method, however, you alway multiply rows (or columns, depending how you look at the matrix).

I think argb32's idea works if the second matrix is actually the result of transposition of the original matrix. It makes it possible to use SSE2 multiplication on these matrices, for instance.
Title: Re: efficiency problem
Post by: marcov on March 17, 2015, 10:29:19 am
I fixed the code from my original post, and it was barely faster, despite the main (k) loop only have one asm instruction different (the rowstride addition from a stack variable instead of just inc(<register>).   Somehow I couldn't convince the compiler to register that (stride) variable.

For me that is a strange result, since in such cases Delphi is faster. Maybe still some inflexibility in the register allocator.

Strange that something like that can have a magnitude 3x effect.  The times for me (i7-3770 3.1.1 -O4): ptrcode 3.3s, original code 10s, my code 9s. (so barely faster than the original array approach).   64-bit was 0.9s but there extended= double=8bytes it seems, corresponding to the observations argb32  already did.

I saw that the mat2 copy is done using stos/stosl, and the compiler seems to do some loop mods itself now.

I'll update this post with the code if I'm on my home machine again. (am home but on work laptop)
Title: Re: efficiency problem
Post by: engkin on March 17, 2015, 06:18:37 pm
Here is my experiment using SSE2 code:
Code: [Select]
program MatMulSSE2;
uses
  sysutils;

type
  valuetype = double; { should not be changed }

  {$R-}{$Q-}
const
  n = 1000;

var
  M1,M2,TM2,M3: array[1..n,1..n] of valuetype;
  row,col,k: integer;
  T: TDateTime;
  sum: valuetype;
  sums:array[1..2] of valuetype;
  gsum: valuetype;
  pRow, pCol: pointer;

label
  again;
begin
  Randomize;
 for row := 1 to n do
  for col := 1 to n do begin
    { example 1 } //{
    M1[row,col] := row; { or col }
    M2[row,col] := 1;//}

    { example 2 } {
    M1[row,col] := Random*50;
    M2[row,col] := Random*50;//}

    { example 3 } {
    if row = col then
    begin
      M1[row,col] := 1;
      M2[row,col] := 1;
    end
    else
    begin
      M1[row,col] := 0;
      M2[row,col] := 0;
    end;//}

    { Transposition of the second matrix }
    TM2[col, row] := M2[row,col];
  end;
 T := now;
 gsum := 0;
 for row := 1 to n do  //row from 1st matrix
  for col := 1 to n do  //col from 2nd matrix
  begin
    {  original code in Pascal --> to use remove } {
    sum := 0;
    for k:=1 to n do  //col from 1st / row from 2nd
      { using matrix M2 } {
      sum:=sum + M1[row,k]*M2[k,col]; //}

      { using transposed matrix TM2 }
      sum:=sum + M1[row,k]*TM2[col,k];
    M3[row, col] := sum;  //}

    {  SSE2 code } //{
    { SSE2 handles two double values at the same time }
    sums[1] := 0; sums[2] := 0;
    pRow := @M1[row, 1];
    pCol := @TM2[col, 1];

    {$ASMMODE intel}
    asm
      { loop counter }
      mov ecx, n/2

      { sum := 0 }
      movupd xmm2, sums

      { @M1[Row, 1] }
      mov eax, pRow

      { @M2[1, Col] }
      mov ebx, pCol

      again:

      { xmm0 := M1[Row, k .. k+1] }
      movapd xmm0, [eax]
      //movupd xmm0, [eax]

      { xmm1 := TM2[Col, k .. k+1] }
      movapd xmm1, [ebx]

      { xmm0 := M1[Row, k] * M2[k, Col] , M1[Row, k+1] * M2[k+1, Col] }
      mulpd xmm0, xmm1

      { xmm2 := xmm2 + xmm0 }
      addpd xmm2, xmm0

      { point at next pair of columns from M1[Row] }
      add eax, 16 { 2 * size of valuetype }

      { point at next pair of rows from TM2[Col] }
      add ebx, 16

      dec ecx
      jnz again

      { following steps are equivelent to sums[1] := sums[1] + sums[2] }

      { xmm3 := xmm2 }
      movupd xmm3, xmm2

      { copy double value from higher position to lower position in xmm3 }
      unpckhpd xmm3, xmm3

      { add the two numbers: sums[1] := sums[1] + sums[2] }
      addpd xmm2, xmm3
      movupd sums, xmm2
    end ['eax', 'ebx', 'ecx', 'xmm0', 'xmm1', 'xmm2', 'xmm3'];
    M3[row, col] := sums[1];// + sum[2];//}
    gsum := gsum + M3[row, col];
  end;

  WriteLn('Grand sum of all the numbers in the result matrix: ', gsum);
  WriteLn('Time: ', round((now-T)*24*60*60*1000),' ms');
  WriteLn('');
  WriteLn('First 10 rows and 3 columns of the result matrix:');
  for row := 1 to 10 do
  begin
    for col := 1 to 3 do
      Write(M3[row, col], ' ');
    WriteLn('');
  end;
 end.

I adjusted one of the codes posted above. It doubled the speed on my old 32bit system. SSE2 can handle two double precision multiplications at the same time.
Title: Re: efficiency problem
Post by: Nitorami on March 17, 2015, 06:49:06 pm
Not bad. On My AMD Athlon, it is 3000ms (vs. 4400ms using plain pascal with SSE2 support). Still, I would think the compiler is not doing a bad job.

Wait... 3700ms without using assembler  :)

Code: [Select]
{$FPUTYPE SSE3}
Function Mulx(Const mat1,mat2:Matrix):Matrix;
Var
  i,j,k,m: longint;
  sum: float;
  u,v: Pfloat;
  mat: Matrix;
Begin
  For i:=0 to n-1 do for j:=0 to n-1 do mat[j,i]:=mat2[i,j];
  For i:=0 to n-1 do
  begin
    For j:=0 to n-1 do begin
      sum:=0;
      u:=@mat1[i];
      v:=@mat [j];
      For k:= 0 to n div 2 -1 do
      Begin
        sum+=u^*v^; Inc(u);Inc(v);
        sum+=u^*v^; Inc(u);Inc(v);
      End;
      result[i,j]:=sum;
    end;
  end;
End;
Title: Re: efficiency problem
Post by: argb32 on March 17, 2015, 09:42:17 pm
wp, engkin: it's not my idea, it's usual "ikj" algorithm and gives identical result to "ijk" one without any transpositions. The speedup is due to more cache friendly memory access pattern.
There is almost no speedup when Float = Extended and Float = Int64 for some reason. Didn't test in 64-bit mode.
My implementation missing initial zeroing out (but test passed  :o ). Here is fixed one (still 3 times faster for double):
Code: [Select]
Type
  Float = Double;
  PFloat = ^Float;
  Vector=Array[0..n-1] of Float;
  Matrix=Array[0..n-1] of Vector;
  PMatrix = ^Matrix;

procedure Multiply(Const mat1, mat2: PMatrix; var Result: PMatrix);
Var
  i, j, k: Word;
  u, v, x: PFloat;
Begin
  For i:=0 to n-1 do Begin
    u := @mat1^[i];
    x := @Result^[i];
    For k:=0 to n-1 do begin
      x^ := 0;
      inc(x);
    end;
    For k:=0 to n-1 do begin
      v := @mat2^[k];
      x := @Result^[i];
      For j:=0 to n-1 do begin
        x^ += u^*v^;
        inc(v);
        inc(x);
      end;
      inc(u);
    End;
  end;
End;

It's sad that the compiler doesn't do the job with pointer arithmetics for array indexing and let us do it by hand.
Pointer arithmetics is a bad practice.
Title: Re: efficiency problem
Post by: Leledumbo on March 18, 2015, 01:57:57 am
It's sad that the compiler doesn't do the job with pointer arithmetics for array indexing and let us do it by hand.
Pointer arithmetics is a bad practice.
It is, I'd love to get my hands dirty but I'm lacking the required skills. I'll retry when I get some time, maybe I was just not experienced enough.
Title: Re: efficiency problem
Post by: engkin on March 18, 2015, 02:50:22 pm
@Nitorami, impressive. I wonder if SSE2 speed on AMD CPUs is not as good as on Intel's.

@argb32, without transposition the results are wrong.
Title: Re: efficiency problem
Post by: Nitorami on March 18, 2015, 06:17:57 pm
Quote
I wonder if SSE2 speed on AMD CPUs is not as good as on Intel's.
My trusted AMD CPU is 10 years old... don't blame it  ;)

Quote
without transposition the results are wrong

I think not. The transposition is to avoid cache thrashing by accessing elements in steps of 1000. The same can be achieved by restructuring the loop like

Code: [Select]
  For i:=0 to n-1 do
  for k:=0 to n-1 do
  for j:=0 to n-1 do result[i,j] += mat1[i,k]*mat2[k,j];

The result looks ok to me, and the performance is approximately the same as with the transposition method.

Unfortunately, as arg32b said, we really need pointer arithmetics, as the plain pascal code is 3x slower. I found that for simple loops in one-dimensional arrays, FPC seems to produce quite efficient code, but for two-dimensional arrays there is a lot of room for optimisation.


Title: Re: efficiency problem
Post by: engkin on March 18, 2015, 06:58:43 pm
Ah, that explains it. Thanks Nitorami.
Title: Re: efficiency problem
Post by: photor on March 19, 2015, 10:51:20 am
Maybe free pascal automatically initializes a variable to be zero?
And I forgot to mention that my Windows 8.1 is 64bit.

wp, engkin: it's not my idea, it's usual "ikj" algorithm and gives identical result to "ijk" one without any transpositions. The speedup is due to more cache friendly memory access pattern.
There is almost no speedup when Float = Extended and Float = Int64 for some reason. Didn't test in 64-bit mode.
My implementation missing initial zeroing out (but test passed  :o ). Here is fixed one (still 3 times faster for double):
Code: [Select]
Type
  Float = Double;
  PFloat = ^Float;
  Vector=Array[0..n-1] of Float;
  Matrix=Array[0..n-1] of Vector;
  PMatrix = ^Matrix;

procedure Multiply(Const mat1, mat2: PMatrix; var Result: PMatrix);
Var
  i, j, k: Word;
  u, v, x: PFloat;
Begin
  For i:=0 to n-1 do Begin
    u := @mat1^[i];
    x := @Result^[i];
    For k:=0 to n-1 do begin
      x^ := 0;
      inc(x);
    end;
    For k:=0 to n-1 do begin
      v := @mat2^[k];
      x := @Result^[i];
      For j:=0 to n-1 do begin
        x^ += u^*v^;
        inc(v);
        inc(x);
      end;
      inc(u);
    End;
  end;
End;

It's sad that the compiler doesn't do the job with pointer arithmetics for array indexing and let us do it by hand.
Pointer arithmetics is a bad practice.
Title: Re: efficiency problem
Post by: photor on March 19, 2015, 12:23:11 pm
Using argb32's idea and your sse3 optimization, now it only takes less than 1s (about 950ms) on my machine:
Code: [Select]
{$FPUTYPE SSE3}
Function Mul(Const mat1,mat2:Matrix):Matrix;
Var
 i,j,k:Word;
 u,v,p:pextended;
Begin
 For i:=0 to n-1 do
 begin
  u:=@mat1[i];
  For j:=0 to n-1 do
  Begin
   v:=@mat2[j];p:=@Mul[i];
   For k:=0 to (n div 2)-1 do
   begin
    p^+=u^*v^;inc(v);inc(p);
    p^+=u^*v^;inc(v);inc(p);
   end;
   inc(u);
  End;
 end;
End;

Not bad. On My AMD Athlon, it is 3000ms (vs. 4400ms using plain pascal with SSE2 support). Still, I would think the compiler is not doing a bad job.

Wait... 3700ms without using assembler  :)

Code: [Select]
{$FPUTYPE SSE3}
Function Mulx(Const mat1,mat2:Matrix):Matrix;
Var
  i,j,k,m: longint;
  sum: float;
  u,v: Pfloat;
  mat: Matrix;
Begin
  For i:=0 to n-1 do for j:=0 to n-1 do mat[j,i]:=mat2[i,j];
  For i:=0 to n-1 do
  begin
    For j:=0 to n-1 do begin
      sum:=0;
      u:=@mat1[i];
      v:=@mat [j];
      For k:= 0 to n div 2 -1 do
      Begin
        sum+=u^*v^; Inc(u);Inc(v);
        sum+=u^*v^; Inc(u);Inc(v);
      End;
      result[i,j]:=sum;
    end;
  end;
End;
Title: Re: efficiency problem
Post by: photor on March 19, 2015, 02:00:20 pm
No. By looking into the assembler code, I found that free pascal still compile it as two mulsd  instructions, instead of one mulpd instruction. The speed acceleration is due to reduction of jumping time by halving the loops, I think. It is sad that free pascal can't do sse2 optimization in such simple case  :'(

Not bad. On My AMD Athlon, it is 3000ms (vs. 4400ms using plain pascal with SSE2 support). Still, I would think the compiler is not doing a bad job.

Wait... 3700ms without using assembler  :)

Code: [Select]
{$FPUTYPE SSE3}
Function Mulx(Const mat1,mat2:Matrix):Matrix;
Var
  i,j,k,m: longint;
  sum: float;
  u,v: Pfloat;
  mat: Matrix;
Begin
  For i:=0 to n-1 do for j:=0 to n-1 do mat[j,i]:=mat2[i,j];
  For i:=0 to n-1 do
  begin
    For j:=0 to n-1 do begin
      sum:=0;
      u:=@mat1[i];
      v:=@mat [j];
      For k:= 0 to n div 2 -1 do
      Begin
        sum+=u^*v^; Inc(u);Inc(v);
        sum+=u^*v^; Inc(u);Inc(v);
      End;
      result[i,j]:=sum;
    end;
  end;
End;
Title: Re: efficiency problem
Post by: photor on March 19, 2015, 04:33:15 pm
Added note: My free pascal version is 2.64. I don't know if higher version will do the correct sse2 optimization.
Title: Re: efficiency problem
Post by: Nitorami on March 19, 2015, 06:50:20 pm
@photor

Quote
Maybe free pascal automatically initializes a variable to be zero?
No, only dynamic arrays are zeroed on creation with SetLength (). But I think argb32 does in fact zero his tmp variable by
Code: [Select]
    For k:=0 to n-1 do begin
      x^ := 0;
      inc(x);

Quote
And I forgot to mention that my Windows 8.1 is 64bit.
If wou want to use 64 bit, you can download the x64 version of FPC. It is a command line cross compiler and comes without IDE ; to use it with IDE you would have to build it yourselves. At least this is what I understood. I do not know how to do this however.
Quote
It is sad that free pascal can't do sse2 optimization in such simple case
Ja, well, I would love FPC to be the best optimising compiler in the world, but I guess that would need a lot of ressources, which are tied otherwise, for issues such as unicode support and platform independancy. We should also consider that the vast majority of programmers are doing C or one of its clones, and accordingly huge effort has been put into these compilers. It should be no surprise that they may perform better.
On the other hand, algebraic algorithms like matrix multiplication have been optimised manually to buggery over the past 30 years, each CPU manufacturer offering his own optimised BLAS library, and no compiler will reach the performance of these highly tuned libraries anyway. The best solution in your case is probably to link one of these libraries into your code. Why re-inventing the wheel ?

Quote
Added note: My free pascal version is 2.64. I don't know if higher version will do the correct sse2 optimization.
I have tried it with 3.1.1 (as of 29.1.) and the answer is no.


Title: Re: efficiency problem
Post by: photor on March 20, 2015, 03:38:12 pm
On the other hand, algebraic algorithms like matrix multiplication have been optimised manually to buggery over the past 30 years, each CPU manufacturer offering his own optimised BLAS library, and no compiler will reach the performance of these highly tuned libraries anyway. The best solution in your case is probably to link one of these libraries into your code. Why re-inventing the wheel ?
But how to find and link these libraries, if it is free (I don't want to use commercial things in my program), for example, for an Intel 64bit CPU, in my free pascal program?
Title: Re: efficiency problem
Post by: Leledumbo on March 20, 2015, 06:37:31 pm
But how to find and link these libraries, if it is free (I don't want to use commercial things in my program), for example, for an Intel 64bit CPU, in my free pascal program?
Googling its name and read the license yourself?
http://www.netlib.org/lapack/
http://www.netlib.org/blas/
Title: Re: efficiency problem
Post by: Nitorami on March 20, 2015, 06:48:35 pm
Yes, and the wikipedia article on BLAS also contains a lot of links to BLAS implementations. The BLAS interface is de facto a standard, rather old, and many implementations have been created, most of them open source. CPU manufacturers may offer their own implementation for download, particularly tuned to their specific processor. It might be better to get a generic implementation which is less optimised but more portable.

The problem for a noob like me is to get an overview in the jungle. I downloaded a windows dll from OpenBLAS (dll file) just to try it out, and managed to link it as external library, but then got an error message that this dll needs yet another fortranwhatever.dll, o heavens, I do not know where to get that, and I do not like have my programs depend on a pile of external dlls in the first place so I gave it up. Can't be bothered to fuff around with that stuff any longer, but if you find out how to do it I would be interested to know...

Edit - I should also be greateful if someone could explain why an external library is not sufficient to get this thing running, and why I need to install MinGW an MSys and what not.
Title: Re: efficiency problem
Post by: JorgeAldo on March 20, 2015, 09:46:46 pm
i ve not delved into the whole topic, but if you problem is easily parallelizable and the setup time inst too big you might try to split it into threads,

search google for pascal actor model
its a library of mine that makes easier to parallelize problems in the background into an arbitrary number of actors
its fully oop and allows easy memory passing among actors and the foreground threads
Title: Re: efficiency problem
Post by: mas steindorff on March 21, 2015, 01:57:02 am
I have not followed him in a while but you should check into the stuff "aminer" offers.  I believe I saw he offered a multi threaded matrix multiply a while ago were the times were almost 50 times faster than a single thread. 
Title: Re: efficiency problem
Post by: mnar53 on March 21, 2015, 07:44:11 am
just download from the openblas sourceforge mingw32_dll.zip or mingw64_dll.zip (if you are after 32 or 64 bit environment).
They both contain 3 dll's, which are needed at the runtime because how the openblas dll was compiled: put them in the
same folder of openblas. One final note: all the function in openblas are following the cdecl convention, as an example:

function  dasum  (constref N: integer; X : pDoubleArray; constref incX : integer) : double; cdecl; external 'openblas.dll'';

Openblas is quite performant, actually i think is the best CPU computing choice, unless you are moving to GPU
computing (e.g., CUDA), assuming you have a good graphics card.

By the way: Openblas actually contains all the Lapack routines, some of them optimized, but i think they all benefit from
the use of openblas routines.

Best regards
Title: Re: efficiency problem
Post by: Nitorami on March 21, 2015, 10:58:42 am
Thank you all. I copied the lipopenblas.dll and other required dlls, and it compiles. I set up a small example for a quadratic matrix multiply but it throws an access violation. Probably an error in passing the parameters in C style. I tried constref an other variations but does not help. Can someone give me a hint how it should be done ?

Code: [Select]
{$mode objfpc}
Uses SysUtils;

Const Dim=1000;

Type
 Matrix=Array[0..Dim-1, 0..Dim-1] of double;
 int = longint;

Var
 T0: TDateTime;
 A0, B0, C0: Matrix;
 A, B, C: pdouble;


procedure DGEMM (TRANSA,TRANSB: char;
                  m,n,k: int;
                  alpha: double;
                  A    : pdouble;
                  LDA  : int;
                  B    : pdouble;
                  LDB  : int;
                  beta : double;
                  C    : pdouble;
                  LDC  : int); cdecl; external 'libopenblas';



procedure Init (var M: Matrix);
Var i,j: dword;
Begin
 For i:=0 to Dim-1 do for j:=0 to Dim-1 do M[i,j] := 1.0* (i+j);
End;

begin
  Init (A0);  B0:=A0; C0 := B0;
  A := @A0; B := @B0; C := @C0;
  T0 := Now;;

  DGEMM ('n','n', 1000, 1000, 1000, 1.0, A, 1000, B, 1000, 0.0, C, 1000);

  Writeln('Time elapsed: ',(Now-T0)*3600*24*1000:0:0,'ms.');
end.


EDIT: Wait. It works with the constref prefix for ALL parameters including the chars but except the pfloats.

Performance: Single core AMD Athlon, 32 bit BLAS: 7 times faster than the best pascal code in this thread (excluding engkins assembler version).
Title: Re: efficiency problem
Post by: Nitorami on March 21, 2015, 11:33:27 am
Alright, we don't even need the pointers if the matrix is passed by variable.  Attached simplified code.

@photor - instruction:
Download BLAS and mingw pack from http://sourceforge.net/projects/openblas/files/v0.2.13/. Unpack the dlls into a directory of your choice and set this directory under Options / Directories / Libraries in the FPC IDE. Then the code should already compile.

You might consider using the 64bit version, and the multithreaded options proposed by Steindorff and JorgeAldo. On my single core CPU, I do not expect a benefit from multiple threads.

Code: [Select]
{$mode objfpc}
Uses SysUtils;

Const Dim=1000;

Type
 Matrix=Array[0..Dim-1, 0..Dim-1] of double;
 int = longint;

Var
 T0: TDateTime;
 A0, B0, C0: Matrix;

procedure DGEMM (constref TRANSA,TRANSB: char;
                  constref m,n,k: int;
                  constref alpha: double;
                  var A    : Matrix;
                  constref LDA  : int;
                  var B    : Matrix;
                  constref LDB  : int;
                  constref beta : double;
                  var C    : matrix;
                  constref LDC  : int); cdecl; external 'libopenblas';

procedure Init (var M: Matrix);
Var i,j: dword;
Begin
 For i:=0 to Dim-1 do for j:=0 to Dim-1 do M[i,j] := 1.0* (i+j);
End;

begin
  Init (A0);  B0:=A0;
  T0 := Now;;
  DGEMM ('n','n', Dim, Dim, Dim, 1.0, A0, Dim, B0, Dim, 0.0, C0, Dim);
  Writeln('Time elapsed: ',(Now-T0)*3600*24*1000:0:0,'ms.');
end.
Title: Re: efficiency problem
Post by: mnar53 on March 21, 2015, 12:47:08 pm
I think it's better to use a pointer for your data, it's more general. As an example, using a FreePascal dll and pass to it a vba matrix
(its called pSafeArray, in FreePascal), and from this pSafeArray i can recover a pointer to data. So data are never moving from
vba, neither copied: FreePascal and Openblas operate just on these data.
Title: Re: efficiency problem
Post by: photor on March 21, 2015, 02:58:08 pm
Thanks.

Alright, we don't even need the pointers if the matrix is passed by variable.  Attached simplified code.

@photor - instruction:
Download BLAS and mingw pack from http://sourceforge.net/projects/openblas/files/v0.2.13/. Unpack the dlls into a directory of your choice and set this directory under Options / Directories / Libraries in the FPC IDE. Then the code should already compile.

You might consider using the 64bit version, and the multithreaded options proposed by Steindorff and JorgeAldo. On my single core CPU, I do not expect a benefit from multiple threads.

Code: [Select]
{$mode objfpc}
Uses SysUtils;

Const Dim=1000;

Type
 Matrix=Array[0..Dim-1, 0..Dim-1] of double;
 int = longint;

Var
 T0: TDateTime;
 A0, B0, C0: Matrix;

procedure DGEMM (constref TRANSA,TRANSB: char;
                  constref m,n,k: int;
                  constref alpha: double;
                  var A    : Matrix;
                  constref LDA  : int;
                  var B    : Matrix;
                  constref LDB  : int;
                  constref beta : double;
                  var C    : matrix;
                  constref LDC  : int); cdecl; external 'libopenblas';

procedure Init (var M: Matrix);
Var i,j: dword;
Begin
 For i:=0 to Dim-1 do for j:=0 to Dim-1 do M[i,j] := 1.0* (i+j);
End;

begin
  Init (A0);  B0:=A0;
  T0 := Now;;
  DGEMM ('n','n', Dim, Dim, Dim, 1.0, A0, Dim, B0, Dim, 0.0, C0, Dim);
  Writeln('Time elapsed: ',(Now-T0)*3600*24*1000:0:0,'ms.');
end.
Title: Re: efficiency problem
Post by: finalpatch on March 23, 2015, 02:40:46 pm
I don't believe Mathematica actually performs the matrix multiplication.  It is most likely that it merely returns a lazily evaluated object.  This can work because you only see a small portion of the result at any given time, so it only needs to compute the elements that are visible on screen as you scroll through them.
Title: Re: efficiency problem
Post by: Nitorami on March 23, 2015, 07:11:16 pm
Well possible, it would sound reasonable. But in any case it should be possible to force calculation of the whole matrix by requesting its determinant or similar.

Matlab definitely uses LAPACK to speed up calculations, and I was somewhat surprised to see what optimisation can achieve. Summary for my old machine (Square Matrix 1000 x 1000, double precision):

Photors original code : 23500 ms
More cache friendly layout, by different loop arrangement: 16800ms
More cache friendly layout, by matrix transposition: 8000ms
Either of the two above, but using pointer arithmetics: 4000ms
Further optimisation by loop unrolling : 3000ms - this is I reckon about the best result you can get in Pascal without using assembler programming
OpenBLAS: 560ms

Another factor 3 should be possible by multithreading, and maybe a factor 1.5...2 by going 64bit. Altogether, the 70ms cited by Photor at the start of this thread do not sound unrealistic any longer.


Title: Re: efficiency problem
Post by: argb32 on March 23, 2015, 09:22:37 pm
Most likely Mathematica and other specialized software use different algorithms. For example, Strassen algorithm for matrix multiplication has better complexity than O(n^3). Therefore the more is n the more the difference between such an algorithm and a naive ijk (and even ikj) implementation.
Also Strassen seems to be even more cache friendly as it processes matrix by MxM block. Matrices can be stored as a set of such blocks to avoid conversions.
Title: Re: efficiency problem
Post by: engkin on March 23, 2015, 10:20:26 pm
Mathematica uses ATLAS (http://math-atlas.sourceforge.net/faq.html#who), which during installation tests and selects the fastest kernel (http://math-atlas.sourceforge.net/devel/atlas_contrib/node11.html). Kernels are optimized to benefit from L1 cache.

It turned out that the bottleneck is in moving data from main memory to registers, not the multiplication itself. By making sure data is already in cache before it is needed that by itself increases the speed a lot. Maybe by calling prefetchX set of instructions to prepare data while doing multiplication on numbers that have the size of a cache line.
Title: Re: efficiency problem
Post by: photor on March 31, 2015, 03:37:44 pm
I have tested your code. Compiling is ok, but when running it gives an error message "libgfortran-3.dll not found", why?

Alright, we don't even need the pointers if the matrix is passed by variable.  Attached simplified code.

@photor - instruction:
Download BLAS and mingw pack from http://sourceforge.net/projects/openblas/files/v0.2.13/. Unpack the dlls into a directory of your choice and set this directory under Options / Directories / Libraries in the FPC IDE. Then the code should already compile.

You might consider using the 64bit version, and the multithreaded options proposed by Steindorff and JorgeAldo. On my single core CPU, I do not expect a benefit from multiple threads.

Code: [Select]
{$mode objfpc}
Uses SysUtils;

Const Dim=1000;

Type
 Matrix=Array[0..Dim-1, 0..Dim-1] of double;
 int = longint;

Var
 T0: TDateTime;
 A0, B0, C0: Matrix;

procedure DGEMM (constref TRANSA,TRANSB: char;
                  constref m,n,k: int;
                  constref alpha: double;
                  var A    : Matrix;
                  constref LDA  : int;
                  var B    : Matrix;
                  constref LDB  : int;
                  constref beta : double;
                  var C    : matrix;
                  constref LDC  : int); cdecl; external 'libopenblas';

procedure Init (var M: Matrix);
Var i,j: dword;
Begin
 For i:=0 to Dim-1 do for j:=0 to Dim-1 do M[i,j] := 1.0* (i+j);
End;

begin
  Init (A0);  B0:=A0;
  T0 := Now;;
  DGEMM ('n','n', Dim, Dim, Dim, 1.0, A0, Dim, B0, Dim, 0.0, C0, Dim);
  Writeln('Time elapsed: ',(Now-T0)*3600*24*1000:0:0,'ms.');
end.
Title: Re: efficiency problem
Post by: photor on March 31, 2015, 03:58:15 pm
I have not followed him in a while but you should check into the stuff "aminer" offers.  I believe I saw he offered a multi threaded matrix multiply a while ago were the times were almost 50 times faster than a single thread.
I have checked aminer's stuff, but it has gone.
Title: Re: efficiency problem
Post by: engkin on March 31, 2015, 04:07:03 pm
I have tested your code. Compiling is ok, but when running it gives an error message "libgfortran-3.dll not found", why?

You need to download mingw32_dll.zip (http://sourceforge.net/projects/openblas/files/v0.2.13/mingw32_dll.zip/download) (or mingw64_dll.zip (http://sourceforge.net/projects/openblas/files/v0.2.13/mingw64_dll.zip/download)) from that same link.
Title: Re: efficiency problem
Post by: photor on March 31, 2015, 04:33:59 pm
I have tested your code. Compiling is ok, but when running it gives an error message "libgfortran-3.dll not found", why?

You need to download mingw32_dll.zip (http://sourceforge.net/projects/openblas/files/v0.2.13/mingw32_dll.zip/download) (or mingw64_dll.zip (http://sourceforge.net/projects/openblas/files/v0.2.13/mingw64_dll.zip/download)) from that same link.
Of course I have mingw64.dll, otherwise there would be an error message "mingw64.dll not found" before the libgfortran-3.dll one.
Title: Re: efficiency problem
Post by: engkin on March 31, 2015, 04:45:49 pm
Of course I have mingw64.dll, otherwise there would be an error message "mingw64.dll not found"

Yes, sure. You *might* have mingw64.dll alone. The link provides libgfortran-3.dll, libgcc_s_seh-1.dll, and libquadmath-0.dll which you need to put in the same directory (or path) with your exe file. So yes, you need to download mingw64_dll.zip and expand it and put the files visible to your application.
Title: Re: efficiency problem
Post by: Nitorami on March 31, 2015, 04:47:17 pm
The dlls are loaded at runtime, and need to be found by the operating system. Put them in the search path, or in the same folder as the exe file.
Title: Re: efficiency problem
Post by: mas steindorff on March 31, 2015, 08:10:58 pm
I have not followed him in a while but you should check into the stuff "aminer" offers.  I believe I saw he offered a multi threaded matrix multiply a while ago were the times were almost 50 times faster than a single thread.
I have checked aminer's stuff, but it has gone.
here is a post for his demo for matrix back in 2013
http://forum.lazarus.freepascal.org/index.php/topic,23009.msg136762.html#msg136762
he still has the code on his web site at
http://pages.videotron.com/aminer/
were he says he worked off a math library from "mrsoft" @
http://www.mrsoft.org/home/index.html
hope that helps
Title: Re: efficiency problem
Post by: photor on April 04, 2015, 04:31:32 am
Of course I have mingw64.dll, otherwise there would be an error message "mingw64.dll not found"

Yes, sure. You *might* have mingw64.dll alone. The link provides libgfortran-3.dll, libgcc_s_seh-1.dll, and libquadmath-0.dll which you need to put in the same directory (or path) with your exe file. So yes, you need to download mingw64_dll.zip and expand it and put the files visible to your application.
Thanks. Following your instruction, I downloaded mingw64_dll.zip and obtained all the dll files needed. This time the test program compiled and run, but finally raised an exeption class "External: SIGSEGV". Has anybody successfully tested this in the 64bit case?
Title: Re: efficiency problem
Post by: photor on April 04, 2015, 04:39:12 am
I have not followed him in a while but you should check into the stuff "aminer" offers.  I believe I saw he offered a multi threaded matrix multiply a while ago were the times were almost 50 times faster than a single thread.
I have checked aminer's stuff, but it has gone.
here is a post for his demo for matrix back in 2013
http://forum.lazarus.freepascal.org/index.php/topic,23009.msg136762.html#msg136762
he still has the code on his web site at
http://pages.videotron.com/aminer/
were he says he worked off a math library from "mrsoft" @
http://www.mrsoft.org/home/index.html
hope that helps
I have checked all of these. The download links on his website are all redirected to a docs.google.com site, and returned a "not found" message eventually.
Title: Re: efficiency problem
Post by: Nitorami on April 04, 2015, 10:40:06 am
Yes, I had the same issue with aminers files. But having a single core CPU only, his major approach of splitting up tasks into threads ("workers") won't be a benefit for me anyway.

I have not tried the 64bit version because I already thought this might be a step too far and would cause problems, most likely with passing the parameters. Even in a 32bit environment, it took me a while to translate the C expectations into Pascal syntax, using constref and var as required, and I receives many sigsevs before I got it right.
In the 64bit version, the parameters will yet be passed differently, in different registers and what not. It may be easier with the 64bit FPC cross compiler, which is available but without IDE; to work with an IDE, you will have to make an own built in the first place. My advice is - the 32bit version spares you a lot of hassle, and you will not get a factor 2 improvement from using 64bit anyway.

Title: Re: efficiency problem
Post by: photor on April 05, 2015, 03:36:06 pm
Yes, I had the same issue with aminers files. But having a single core CPU only, his major approach of splitting up tasks into threads ("workers") won't be a benefit for me anyway.

I have not tried the 64bit version because I already thought this might be a step too far and would cause problems, most likely with passing the parameters. Even in a 32bit environment, it took me a while to translate the C expectations into Pascal syntax, using constref and var as required, and I receives many sigsevs before I got it right.
In the 64bit version, the parameters will yet be passed differently, in different registers and what not. It may be easier with the 64bit FPC cross compiler, which is available but without IDE; to work with an IDE, you will have to make an own built in the first place. My advice is - the 32bit version spares you a lot of hassle, and you will not get a factor 2 improvement from using 64bit anyway.
But where can I download the 64bit FPC cross compiler?
Title: Re: efficiency problem
Post by: Nitorami on April 05, 2015, 04:32:28 pm
http://www.freepascal.org/download.var

But if you do not absolutely need it, better stay with the standard 32 bit version !!!
Title: Re: efficiency problem
Post by: photor on April 05, 2015, 04:41:38 pm
http://www.freepascal.org/download.var
Thanks. But all the download links on that page will be redirected to
http://sourceforge.net/projects/freepascal/files/
, and I still cannot find the cross compiler there, sorry.
Title: Re: efficiency problem
Post by: Nitorami on April 05, 2015, 04:44:54 pm
Instead of sourceforge, go to austria
Title: Re: efficiency problem
Post by: photor on April 05, 2015, 05:39:43 pm
Instead of sourceforge, go to austria
Ok, I see.
Title: Re: efficiency problem
Post by: mas steindorff on April 06, 2015, 08:16:47 pm

I have checked all of these. The download links on his website are all redirected to a docs.google.com site, and returned a "not found" message eventually.
[/quote]
FYI I just tried downloading a few zips from these links with success (using chrome).   may have been down when you tried.
Title: Re: efficiency problem
Post by: engkin on April 07, 2015, 03:11:35 am
Thanks. Following your instruction, I downloaded mingw64_dll.zip and obtained all the dll files needed. This time the test program compiled and run, but finally raised an exeption class "External: SIGSEGV". Has anybody successfully tested this in the 64bit case?

Sorry I missed your post. I am on a 32bit OS so I can't be sure. But I see that the 64bit OpenBLAS comes in two flavors:
Quote
Win64-int32 for 64-bit Windows with 32-bit int (default on 64-bit Windows)
Win64-int64 for 64-bit Windows with 64-bit int
You might want to try the first one, as Nitorami's code is suitable for 32bit int:
Code: [Select]
int = longint;
OR (and this is just a guess) make:
Code: [Select]
  int = int64;
Title: Re: efficiency problem
Post by: photor on April 08, 2015, 03:50:15 pm
Thank you very much!
After I replace "longint" with "int64", the 64bit version works!

Thanks. Following your instruction, I downloaded mingw64_dll.zip and obtained all the dll files needed. This time the test program compiled and run, but finally raised an exeption class "External: SIGSEGV". Has anybody successfully tested this in the 64bit case?

Sorry I missed your post. I am on a 32bit OS so I can't be sure. But I see that the 64bit OpenBLAS comes in two flavors:
Quote
Win64-int32 for 64-bit Windows with 32-bit int (default on 64-bit Windows)
Win64-int64 for 64-bit Windows with 64-bit int
You might want to try the first one, as Nitorami's code is suitable for 32bit int:
Code: [Select]
int = longint;
OR (and this is just a guess) make:
Code: [Select]
  int = int64;
Title: Re: efficiency problem
Post by: photor on April 08, 2015, 03:55:51 pm
Result comparison on my 2.7GHz i7 cpu with 64bit Win 8.1:
OpenBLAS 64bit: 140~150ms
Mathematica: 70~80ms
Although Mathematica is still two times faster than OpenBLAS, they are comparable.
Title: Re: efficiency problem
Post by: engkin on April 08, 2015, 05:13:55 pm
After I replace "longint" with "int64", the 64bit version works!

That's good news, thanks for reporting back!

Result comparison on my 2.7GHz i7 cpu with 64bit Win 8.1:
OpenBLAS 64bit: 140~150ms
Mathematica: 70~80ms
Although Mathematica is still two times faster than OpenBLAS, they are comparable.

I believe Mathematica uses ATLAS (http://forum.lazarus.freepascal.org/index.php/topic,27730.msg172451.html#msg172451), and probably with threads (http://math-atlas.sourceforge.net/faq.html#tsafe). If you want to reach to that speed, consider using threads and/or switch to ATLAS.
Title: Re: efficiency problem
Post by: mnar53 on April 15, 2015, 08:11:43 pm
About the "difficulty" of the 64 bit environment: not really. In x64 there is only ONE CALLING CONVENTION,
so you do not care if a 64bit dll was written in freepascal, c, or fortran: it's the same. And forget about
'cdecl' or 'stdcall'.

By the way, this opens up many interesting possibilities: ANY 64 bit dll can be accessed. For example,
CUDA or CUBLAS, if we're after numerical computations on the GPU.
Title: Re: efficiency problem
Post by: DelphiLazus on April 26, 2017, 02:49:45 am
Good Morning,
Firstly I apologize for restarting this old thread but I think that what has been debated in this thread is the closest to my doubts and that is related to the topic.
If it is necessary to start a new thread I would be grateful if you would let me know.
Secondly, apologize that I have written too much because I have tried to be as precise and detailed as possible, and I hope that I will be understood in spite of my lousy English.

I will explain my case.
I have implemented a "mini framework" dedicated to linear algebra operations and related topics. I have been using it for some time for my projects and it is time to improve it for other projects that require a more robust design.
My "mini framework" is based on dynamic two-dimensional arrays (in the case of matrix), and now that I have become familiar with more advanced topics I am taking a one-dimensional scheme. In other words, try to work the matrix as a vector.
For what I was understanding, when one does SetLength(array, rows, cols) is reserving a memory that while is continuous, internally, each row is "independent" and this leads to requiring double pointers (or pointer To pointers) to access a position (i, j) within the array.
With the one-dimensional scheme we have the advantage that being all continuous, we can implement algorithms more friendly to the cache.

As far as my knowledge goes, I know that it should be possible to improve the performance of my algorithms with the help of pointers and a bit of "magic with the pointer aritemética". So instead of using the classic SetLength (array, dim) I was trying to reserve memory directly like this: AllocMem (punt-array, rows * cols * sizetype).

I did a couple of tests, with the classic multiplication of matrices, to see how it behaves. And I've taken a couple of surprises. It turns out that my algorithms based on SetLength and employing dynamic arrays are relatively faster than my implementations based exclusively on pointers! Could someone explain to me what it could be?
So then I tried with a "mixed" approach that combines both: 1) declare a dynamic array and 2) access it by pointers. Still, the results are more favorable for Setlength versions and dynamic arrays. The mixed approach has turned out to be just slower than the pointer approach.

In turn, I tried two alternatives: to cross the structure to the old school using 3 cycles, as well as a variation with 2 cycles. It turns out that the version of 3 cycles is twice as fast. For example, for a multiplication (600 x 1000) x (1000 x 800) I obtain average times of 5 seconds for the case of 3 cycles, while 11 seconds for the algorithms of 2 cycles. My explanation, or rather hypothesis, is that with the 2-cycle design I must calculate the position (i, j) within A and B. While the 3-cycle version does not require this extra calculation.
My PC is an Intel i7-2670QM CPU 2.20GHz 64bit, Windows 8.1 64bit, 14GB RAM. I use CodeTyphon 5.9, which if I am not mistaken comes with FPC 3.1.1.
Is it that I can no longer improve my algorithms?

Searching over pointers, matrix multiplication, and enhancement techniques I came to this thread. SSeeing the proposals makes me a conceptual mess. The codes I have seen exposed have their differences from mine and I do not get to understand them at all.
As I see it, here we have debated the case of multiplication of quadratic matrices and using fixed-size arrays, and even one of the propositions assumes that it is of even dimension to take advantage of halving operations. My head tells me that the proposals discussed can be generalized to multiplication (m x n) x (n x p). But I'm not entirely sure if the use of fixed or dynamic arrays had any implications or could affect the final results. My question in this regard is Why use fixed size? Does it matter if I use dynamic arrays? Does pointer access change from one to another?

For my part I have started my research on cache-oblivius algorithms. And it is my intention just to go to that. Precisely a block-based matrix multiplication algorithm with cache-oblivius (and not necessarily quadratic matrices) can be used, recursively dividing the matrices according to the larger dimension. My goal is to combine the cache-oblivius algorithm with the best access technique (pointers, or the traditional SetLength and in dynamic arrays). But first I need to clear my doubt whether I'm doing relatively well at working with matrices with pointers as I currently have. Because if I review the results of the times I should be inclined to use SetLength() with dynamic arrays and not to complicate myself with using pointers.

I had come to test with larger matrices (eg, (30000 x 5000) x (5000 x 4000)), and yet the algorithm with SetLength and dynamic array is the winner. I have hypothesized that the larger the structures the difference between the times will decrease, and it is possible that it will find the size in which the version of the algorithm based on pointers win.

I know there are third-party libraries that do good work. I have seen part of the mrMath code, and the truth is that among all the sea of calls to functions it makes a mess to follow the trail. I do not finish assimilating how it works, it seems that it uses something similar to what I tried in my "mixed" design, but it ends up delegating to ASM instructions and I'm not familiar with ASM. I died there.

I also know that there is LAPLACK and BLAS. That today they are already a standard, and I have read that in this thread has been proposed OpenBLAS. Now, it is not clear to me how LAPLACK, BLAS or OpenBLAS is used from Lazarus and / or Delphi. It's possible? How? I do not read anything in your documentation that would give me a definite YES. As far as I can see LAPLACK is written in FORTRAN. It would not be the first time that I encourage to carry code of FORTRAN to Object Pascal ... is a way that I do not discard in if it were necessary to improve my developments. In the long run I hope to be able to expand my work library and study LAPLACK could be productive. Viewing the LAPLACK code could help improve the one I have.

I had even used TPMath (I had to consult it to study Jacobi's algorithm). And now I'm exploring Alglib.
I'm not entirely convinced to use third-party libraries even though they are free for licensing issues. Alglib by case is GPL, and although with TPMath and mrMath apparently there would be in principle problems in use in both proprietary and free projects I am of the idea of not depending much on third party libraries.

It is not my intention to reach the ASM (not at least in the medium term, perhaps sometime in the future), after all I have no knowledge. But if at least I intend to improve my initial work, since I expect in my new projects to handle larger structures than I was accustomed to.

If someone could the code that I attached and give me some north would be very grateful.
The test application consists of 3 SpinEdits, a few buttons and a Memo for logging.
The SpinEdits allow defining the m, n, and p dimensions of the arrays.
Button1 contains an initialization test code with the pointer method.
Button2 cleans the contents of the memo.
Button3: Multiplication algorithm of 3 cycles using pointers.
Button4: Cycle multiplication algorithm using dynamic array.
Button5: Multiplication algorithm of 3 cycles using the mixed version.
Button6: Test button, to display a dialog box the size of a PDouble.
Button7: Multiplication algorithm of 2 cycles using pointers.
Button8: Multiplication algorithm of 2 cycles using dynamic array.
Button9: Multiplication algorithm of 2 cycles using the mixed version.
Button10: no code.

Thank you very much for reading me, and again, apologize for my lousy English.

Regards,
Title: Re: efficiency problem
Post by: Phil on April 26, 2017, 03:20:40 am
If someone could the code that I attached and give me some north would be very grateful.

I haven't read the earlier posts to the thread and don't really have much interest (or time) to look at your code, but here's a couple of things to think about and try:

- You probably have at least 4 cores on your computer, so figure out how to use threads in your code. That means breaking it down so that a part of the computations can be run in a thread, independent of the other parts. A threaded app on a 4-core computer can realize up to a 4x speed improvement. That alone may be more than all of your ideas will ever realize. And then your code will also be able to scale up if the need arises. For example, an 8-core computer - no changes needed in the code to get even more speed improvement.

Note that this will mean separating your UI from your non-UI (computational) code.

- Try running your app with the -O2 optimization button selected instead of -O1 in Project Options. In my experience, -O2 can give a nice speed bump when a lot of array accesses are done - but only if range-checking is turned off.

You do have range checking turned off (Lazarus default), although I would turn it on during development and only turn it off for production use. I think of compiler-generated range checks as "assertions for free" since you don't have to do anything except flip on a switch to save a lot of headaches tracking down crashes.

Title: Re: efficiency problem
Post by: DelphiLazus on April 26, 2017, 03:40:49 am
I did not expect such a quick response.
Thanks Phil for your advice.

I have to confess I had not considered the use of threads. I know there are third-party libraries, and you pay, that use threads. I do not know how they do it.
I would have to research the subject, and I think that would give it too much complexity. More than I can afford.
I can get a rough idea of ​​how a multiplication of matrices with threads could be envisaged. Using block multiplication, each block assigns a thread to it, and at the end of all copying the results to the resulting matrix. This technique would involve doubling the amount of memory.
I do not know if it is possible to make all threads share the same array.

But the point is that it is not just about multiplying matrices only. My library contains many more things, and it is possible that there are algorithms in which this approach is not justified. For example, I think it is not possible to do QR, but Jacobi's method for symmetric matrices if possible.
It would be a lot of work.

For now I will try these days changing the configuration as you suggest.

Regards,
Title: Re: efficiency problem
Post by: Phil on April 26, 2017, 04:00:01 am
I do not know if it is possible to make all threads share the same array.

http://wiki.freepascal.org/Multithreaded_Application_Tutorial

I generally don't obsess too much over performance. A lot of optimizing of the pure computational parts of an app is lost once you go into production and actually have to work with large files, Internet connection, etc. A program that does a lot of computations often works with large data sets that have to be read in from somewhere - that's probably the biggest bottleneck right there, not the computation itself.

It's also probably cheaper just to get more hardware than it is to try to coax the same speed improvement out of your code. For example, if we both write the same program, one multi-threaded but without obsessing over the computational performance, the other single-threaded but with highly optimized computations. Then the boss says, we need it to be 2x faster. The first program can achieve that for x, the cost of more hardware. The second program may never be able to achieve it, regardless of how cheaply the programmer can work.
Title: Re: efficiency problem
Post by: DelphiLazus on April 26, 2017, 04:39:37 am
Thanks for the link.

I tend to worry about algorithms and performance. I try as much as possible to improve them and use the data structures as best as possible.
The current status of my "mini framework" is not optimal. It has its practical utility, but it is time to give it a "little" push. I'm about to embark on a project in which my arrays will not be small and for this I need to give it a polish. Neither will be 1million x 1 million, but an appreciable amount.

He could offer me a more scalable job. Set milestones for me:
1. Improve data structures and algorithms, where possible and allowed.
2. Add multi-thread support

To get to step 2 I see the long term. Firstly I would like to see that it can be improved without having to put a change as radical as the threads.

I understand your point of view, although my intention is to go giving improvements to steps better measured to my knowledge and times.

If I can improve my concept in something, I have already fulfilled my partial objective.
I do not intend, for example, to make the calculation of a 1000x1000 matrix solve it in 1sec. More than anything I try to improve my algorithms to something more decent.
For example, if with some tips I can reduce the calculation from 11sec to 8sec it's already an approximation for me.

Regards,
Title: Re: efficiency problem
Post by: mischi on April 26, 2017, 07:56:13 am
I did not do a thorough investigation, but if you want the ultimate performance, it will be hard to compete with your own program against Lapack/Blas/Atlas. To the best of my knowledge, memory access is the bottle neck when dealing with large number of data. Specifically, you need to take into account the memory cache, in particular with many data and few operations. I once had an example, where setting 4 arrays to zero was the bottle neck. When trying to use 2 or more cores for each array, the program became SLOWER, because the cores were swapping the arrays in and out of the cache.
This can become quite involved and CPU/cache specific. When compiling Atlas (http://math-atlas.sourceforge.net), several methods are actually tested and the best is taken. At least that is what they claim.

Calling C and Fortran routines is not so difficult. You need to translate the function calls in a Pascal header and link the library. I could help you with a simple example. With 2-dimensional arrays, you need to take into account that C has the same row/column layout as Pascal, whereas Fortran has it reversed. I would expect that you may have to make a decision up front, whether you want a pure Pascal solution and do everything by yourself or go for ultimate performance by calling Lapack/Blas.

All above is only to the best of my knowledge and might be utterly wrong ;-)

MiSchi.
Title: Re: efficiency problem
Post by: Nitorami on April 26, 2017, 10:22:18 am
Sorry I cannot delve into your code now. Just a few hints-

- You cannot compete against LAPACK or similar, this will be at least factor 3-5 better than your best code

- Array of array of float is very convenient but not too efficient as the matrix may be spread across the memory and the compiler has to calculate the memory location on each access. As you already noted, a linear structure in memory is better, i.e. a vector.

- Don't get yourself into the trouble of manual memory allocation. Use dynamic arrays. You can still use pointer access to them if that is better, just assign a pdouble to @Matrix[0]. Only if you repeatedly have to allocate and deallocate the structure, you should keep in mind that when a dynamic array is created it is automatically initialised with zeroes, which is not the case for a structure generated with New(). The initialisation takes a bit of time but in general this is negligible.

- The major bottleneck with matrix multiplication in my experience is that one matrix can be read linearly, i.e. as it is arranged in memory, but the other not, which may give you extreme cache penalties. Always consider to transpose this matrix before mutliplication so it can be accessed linearly.

Edit: Splitting a task to several threads is not necessarily complicated, see the example for the Mandelbrot calculation here http://benchmarksgame.alioth.debian.org/u64q/program.php?test=mandelbrot&lang=fpascal&id=5

Title: Re: efficiency problem
Post by: DelphiLazus on April 26, 2017, 08:01:03 pm
I did not do a thorough investigation, but if you want the ultimate performance, it will be hard to compete with your own program against Lapack/Blas/Atlas. To the best of my knowledge, memory access is the bottle neck when dealing with large number of data. Specifically, you need to take into account the memory cache, in particular with many data and few operations. I once had an example, where setting 4 arrays to zero was the bottle neck. When trying to use 2 or more cores for each array, the program became SLOWER, because the cores were swapping the arrays in and out of the cache.
This can become quite involved and CPU/cache specific. When compiling Atlas (http://math-atlas.sourceforge.net), several methods are actually tested and the best is taken. At least that is what they claim.
MiSchi.
The exchange of memory in the cache is inevitable. With or without threads.
I do not have much experience with threads, and I'm inclined to think that it may be a little more complicated for me.
I do not see myself wearing a design with threads these days. I'll consider it for a "V3" in my library.

I had not heard of ATLAS. I'll keep that in mind.


Calling C and Fortran routines is not so difficult. You need to translate the function calls in a Pascal header and link the library. I could help you with a simple example. With 2-dimensional arrays, you need to take into account that C has the same row/column layout as Pascal, whereas Fortran has it reversed. I would expect that you may have to make a decision up front, whether you want a pure Pascal solution and do everything by yourself or go for ultimate performance by calling Lapack/Blas.

All above is only to the best of my knowledge and might be utterly wrong ;-)
I do not expect to match a LAPLACK, but at least something could be learned and used as a guide to know where to improve my algorithms. So I told myself that if necessary, I'm not afraid to explore the source code and take note.
If you say that it is possible to call LAPLACK routines, I would be grateful if in some free moment you can give me that you give me some feedback. I am aware that in FORTRAN the matrix are col-mayor-order.

To my library, I'm preparing it to deal with "logical layouts" both by rows and columns. This allows me to prepare variants of algorithms that take advantage of the best situations.
By logical layout I mean that while physically in Pascal the arrays are row-major-order. At the logical level, when you create the matrices, you are given a layout on how the data will be organized. Thus, for example, if I want the matrix A of dimension (10 x 30) to be col-greater, the positions 0..9 will correspond to the 1st column; The indices 10..19 to the 2nd column, etc.

Sorry I cannot delve into your code now. Just a few hints-

- You cannot compete against LAPACK or similar, this will be at least factor 3-5 better than your best code
I know. What I have assumed. I can not compete against geniuses who have years perfecting that huge library.
As I said a few paragraphs before, at least I could hope to learn something.

- Array of array of float is very convenient but not too efficient as the matrix may be spread across the memory and the compiler has to calculate the memory location on each access. As you already noted, a linear structure in memory is better, i.e. a vector.

- Don't get yourself into the trouble of manual memory allocation. Use dynamic arrays. You can still use pointer access to them if that is better, just assign a pdouble to @Matrix[0]. Only if you repeatedly have to allocate and deallocate the structure, you should keep in mind that when a dynamic array is created it is automatically initialised with zeroes, which is not the case for a structure generated with New(). The initialisation takes a bit of time but in general this is negligible.
That proposal was my third attempt, which I called "mixed" since it combines the declaration of a dynamic array with access by pointers.
The test I did gave me that it was a bit slower this way than my previous attempts which is based on using pointers in an exclusive way.
To be precise the ranking always, independent of the sizes that I have tried, it was like this:

1st place: Algorithm using SetLength () and dynamic arrays
2nd place: algorithm using AllocMem () and use of pointers
3rd place: mixed algorithm: reserve dynamic array with SetLength () and access by pointers

I have no explanation for this.

- The major bottleneck with matrix multiplication in my experience is that one matrix can be read linearly, i.e. as it is arranged in memory, but the other not, which may give you extreme cache penalties. Always consider to transpose this matrix before mutliplication so it can be accessed linearly.

Edit: Splitting a task to several threads is not necessarily complicated, see the example for the Mandelbrot calculation here http://benchmarksgame.alioth.debian.org/u64q/program.php?test=mandelbrot&lang=fpascal&id=5
I am aware that the traditional algorithm suffers from cache penalties.
It is an alternative to use the transpose of the matrix "B", although unless it is initially planned to store it like this, it is a cost to take into account, in terms of T() (since in the expression O() the asymptote passes through a grade 3 in case of cuadratic matrix).
I am of the idea that if you can prevent the T() value from growing, it is better. In case of opting for transpose, there are also cache-oblivius algorithms; Which I intend to include.

As I said before, my library is preparing for you to accept layout logically. I'm going to implement overloaded versions that work with different layouts.

A way to deal with the penalties of the cache and avoid dealing with being applied to the transpose, and assuming row-order is to recursively divide the matrices and multiply in blocks until you reach a reasonable size and apply unroot lop. The description of the algorithm can be read in the following paper:
Http://supertech.csail.mit.edu/papers/Prokop99.pdf

My goal in this thread is to learn and inform me of the best proposal and alternative to reserve and access an array structure.
Test with the matrix multiplication algorithm in a traditional way without including improvements such as cycle ikj instead of ijk or use multiplication by blocks, and without transpositions is my way of evaluating how different techniques can behave.

The truth is that I did not expect the simplest way to work, which is to use SetLength () and the traditional M [] to give me a better result than using pointers.
I miss it in over way when I explore for example the mrMath code and I see that it uses pointers (and yes, a good amount of ASM) or in Alglib (in its free version) use dynamic array only.

It is expected a trend and that the ranking that I exposed above will continue to be maintained for other situations, operations, and algorithms.
If without improvements already obtains the number 1, with improvements, will continue in the trend. At least I understand that.
Or is there something wrong I'm doing with pointers, or is there "compiler magic" behind the use of dynamic arrays and that Lazarus / FreePascal takes very good advantage of it.

Regards,
Title: Re: efficiency problem
Post by: mischi on April 26, 2017, 08:32:39 pm
If you can come up with an simple example, which uses Lapack/Blas routines, I could try to help out, like a product of a vector and a matrix or similar.
Title: Re: efficiency problem
Post by: DelphiLazus on April 27, 2017, 03:43:15 am
If you can come up with an simple example, which uses Lapack/Blas routines, I could try to help out, like a product of a vector and a matrix or similar.
I think I did not explain myself very well.
I never used or invoked any FORTRAN function since Delphi or Lazarus. I do not know how to do it.
What I did a couple of times is to read loose code in FORTRAN and port it to Object Pascal.

So I had commented that if it is really possible to use LAPLACK from Delphi and/or Lazarus, at some free time you have, if you could guide me through the process. The test algorithms I could already do to check how they work for me.

I found a site after a couple of hours, in Russian, that seems to be reliable and explains how to install LAPLACK and use it from Delphi. He would think that in a similar way it would be possible to do so with Lazarus.
http://lib.dystlab.com/index.php/content/it/50-how-to-use-blas-lapack-in-delphi

I will be able to prove this within 2 days, for reasons of work I must be absent and I will not be in line so often.

Regards,
Title: Re: efficiency problem
Post by: mischi on April 27, 2017, 09:33:50 am
I do not know Russian, but the examples look ok. At least the overall direction is right. I do not know Windows very well and the installation of blas/lapack on windows requires minGW and is quite elaborate. Unfortunately, i will not really be able to help you much on this. Linux and macOS are easier at this point, because the come with Blas/Lapack.

The example loads the library dynamically at run time. Another option is to link it directly. It does not make a big difference.

The example also contains the translation of a function header to Pascal:

TBlasDGEMM = function(TRANSA, TRANSB: PChar; M, N, K: PInteger; ALPHA, A: PDouble; LDA: PInteger; B: PDouble; LDB: PInteger; BETA, C: PDouble; LDC: PInteger): Integer; cdecl;

You need such declarations for all functions you are using and I can help you doing the translation.
As you may see, all parameters are pointers. In Fortran, parameters are always passed by reference, whereas C passes by value. Pascal can do both. For passing by reference, you pass a pointer or use the keyword var.

i never used BLAS/LAPACK extensively, but about 20 other libraries including ffmpeg, which is more complex.

For a start, I suggest to construct a minimal console example for testing, whether installation and building are successful. I can test it on macOS and pass it to you as a reference.

MiSchi.
Title: Re: efficiency problem
Post by: photor on February 09, 2025, 11:33:05 am
After so many years, I just realized that one of the main reasons is that matrices of Pascal (like C) are row major instead of column major, so the Multiply function in the code should be modified as
Code: Pascal  [Select][+][-]
  1. Function Multiply(Const mat1,mat2:Matrix):Matrix;
  2. Var
  3.  i,j,k:Word;
  4.  sum:Float;
  5.  v:Vector;
  6. Begin
  7.  For j:=0 to n-1 do
  8.   Begin
  9.    For k:=0 to n-1 do v[k]:=mat2[k,j];
  10.    For i:=0 to n-1 do
  11.    Begin
  12.     sum:=0;
  13.     For k:=0 to n-1 do sum+=mat1[i,k]*v[k];
  14.     Multiply[i,j]:=sum;
  15.    End;
  16.   End;
  17. End;
  18.  
Now running the code takes about 1.5 seconds, instead of 11 seconds.
Title: Re: efficiency problem
Post by: Paolo on February 17, 2025, 06:27:40 pm
well done Photor,

but why didn't you use I-K-J or K-I-J looping, they are row oriented, and I see they are much faster. Later I'll post my results.
Title: Re: efficiency problem
Post by: LV on February 17, 2025, 11:58:38 pm
After so many years, I just realized that one of the main reasons is that matrices of Pascal (like C) are row major instead of column major, so the Multiply function in the code should be modified as
Code: Pascal  [Select][+][-]
  1. Function Multiply(Const mat1,mat2:Matrix):Matrix;
  2. Var
  3.  i,j,k:Word;
  4.  sum:Float;
  5.  v:Vector;
  6. Begin
  7.  For j:=0 to n-1 do
  8.   Begin
  9.    For k:=0 to n-1 do v[k]:=mat2[k,j];
  10.    For i:=0 to n-1 do
  11.    Begin
  12.     sum:=0;
  13.     For k:=0 to n-1 do sum+=mat1[i,k]*v[k];
  14.     Multiply[i,j]:=sum;
  15.    End;
  16.   End;
  17. End;
  18.  
Now running the code takes about 1.5 seconds, instead of 11 seconds.

It seems that all computers have been multi-core for a long time. We should utilize this capability. :-[

Laptop AMD Ryzen 7 4700U:

Single-threaded program

Code: Pascal  [Select][+][-]
  1. program MatrixMultiplication;
  2.  
  3. uses
  4.   SysUtils,
  5.   Math,
  6.   DateUtils;
  7.  
  8. const
  9.   N = 1000;
  10.  
  11. type
  12.   TMatrix = array [0..N - 1, 0..N - 1] of double;
  13.   TVector = array [0..N - 1] of double;
  14.  
  15. var
  16.   A, B, C: TMatrix;
  17.   StartTime, EndTime: TDateTime;
  18.  
  19.   procedure InitializeMatrix(var M: TMatrix);
  20.   var
  21.     i, j: integer;
  22.   begin
  23.     for i := 0 to N - 1 do
  24.       for j := 0 to N - 1 do
  25.         M[i, j] := Random;
  26.   end;
  27.  
  28.   function Multiply(const mat1, mat2: TMatrix): TMatrix;
  29.   var
  30.     i, j, k: word;
  31.     sum: double;
  32.     v: TVector;
  33.     res: TMatrix;
  34.   begin
  35.     FillChar(res, SizeOf(res), 0);
  36.     for j := 0 to N - 1 do
  37.     begin
  38.       for k := 0 to N - 1 do v[k] := mat2[k, j];
  39.       for i := 0 to N - 1 do
  40.       begin
  41.         sum := 0;
  42.         for k := 0 to N - 1 do sum := sum + mat1[i, k] * v[k];
  43.         res[i, j] := sum;
  44.       end;
  45.     end;
  46.     Multiply := res;
  47.   end;
  48.  
  49. begin
  50.   Randomize;
  51.   InitializeMatrix(A);
  52.   InitializeMatrix(B);
  53.  
  54.   StartTime := Now;
  55.   C := Multiply(A, B);
  56.   EndTime := Now;
  57.  
  58.   Writeln('Execution Time: ', MilliSecondsBetween(EndTime, StartTime), ' ms');
  59.   readln;
  60. end.
  61.  

Execution Time: 876 ms  :(

Multi-threaded program

Code: Pascal  [Select][+][-]
  1. program ParMatrixMultiplication;
  2.  
  3. uses
  4.   SysUtils,
  5.   Classes,
  6.   Math,
  7.   DateUtils;
  8.  
  9. const
  10.   N = 1000; // Matrix size
  11.   ThreadCount = 8; // Number of threads
  12.  
  13. type
  14.   TMatrix = array [0..N - 1, 0..N - 1] of double;
  15.   TVector = array [0..N - 1] of double;
  16.  
  17. var
  18.   A, B, C: TMatrix;
  19.   StartTime, EndTime: TDateTime;
  20.  
  21.   // Matrix initialization with random values
  22.   procedure InitializeMatrix(var M: TMatrix);
  23.   var
  24.     i, j: integer;
  25.   begin
  26.     for i := 0 to N - 1 do
  27.       for j := 0 to N - 1 do
  28.         M[i, j] := Random;
  29.   end;
  30.  
  31.   // Thread for matrix multiplication
  32. type
  33.   TMatrixMultiplier = class(TThread)
  34.   private
  35.     FStartRow, FEndRow: integer;
  36.   protected
  37.     procedure Execute; override;
  38.   public
  39.     constructor Create(StartRow, EndRow: integer);
  40.   end;
  41.  
  42.   constructor TMatrixMultiplier.Create(StartRow, EndRow: integer);
  43.   begin
  44.     inherited Create(True);
  45.     FStartRow := StartRow;
  46.     FEndRow := EndRow;
  47.     FreeOnTerminate := False;
  48.   end;
  49.  
  50.   procedure TMatrixMultiplier.Execute;
  51.   var
  52.     i, j, k: integer;
  53.     sum: double;
  54.     v: TVector;
  55.   begin
  56.     for j := 0 to N - 1 do
  57.     begin
  58.       // Optimization: copy column of matrix B to vector
  59.       for k := 0 to N - 1 do
  60.         v[k] := B[k, j];
  61.  
  62.       // Multiplication for the row range
  63.       for i := FStartRow to FEndRow do
  64.       begin
  65.         sum := 0;
  66.         for k := 0 to N - 1 do
  67.           sum := sum + A[i, k] * v[k];
  68.         C[i, j] := sum;
  69.       end;
  70.     end;
  71.   end;
  72.  
  73.   // Main program
  74. var
  75.   Threads: array of TMatrixMultiplier;
  76.   i, Step: integer;
  77. begin
  78.   Randomize;
  79.   InitializeMatrix(A);
  80.   InitializeMatrix(B);
  81.   SetLength(Threads, ThreadCount);
  82.  
  83.   StartTime := Now;
  84.  
  85.   // Create and start threads
  86.   Step := N div ThreadCount;
  87.   for i := 0 to ThreadCount - 1 do
  88.   begin
  89.     if i < ThreadCount - 1 then
  90.       Threads[i] := TMatrixMultiplier.Create(i * Step, (i + 1) * Step - 1)
  91.     else
  92.       Threads[i] := TMatrixMultiplier.Create(i * Step, N - 1);
  93.     Threads[i].Start;
  94.   end;
  95.  
  96.   // Wait for all threads to finish
  97.   for i := 0 to ThreadCount - 1 do
  98.   begin
  99.     Threads[i].WaitFor;
  100.     Threads[i].Free;
  101.   end;
  102.  
  103.   EndTime := Now;
  104.  
  105.   Writeln('Execution Time: ', MilliSecondsBetween(EndTime, StartTime), ' ms');
  106.   Readln;
  107. end.
  108.  

Execution Time: 164 ms  ;)
Title: Re: efficiency problem
Post by: photor on February 18, 2025, 10:18:54 am
Execution Time: 164 ms  ;)

Good job, LV! But I intend to fairly compare the single-threaded speed first 8)
Title: Re: efficiency problem
Post by: photor on February 18, 2025, 10:26:22 am
well done Photor,

but why didn't you use I-K-J or K-I-J looping, they are row oriented, and I see they are much faster. Later I'll post my results.

Looking forward to seeing your codes and results  8)
Title: Re: efficiency problem
Post by: ALLIGATOR on February 18, 2025, 12:01:10 pm
Code: Pascal  [Select][+][-]
  1. program MatrixMultiplication;
  2. {$mode objfpc}
  3. {$optimization on}
  4.  
  5. uses
  6.   SysUtils,
  7.   DateUtils;
  8.  
  9. const
  10.   N = 1000;
  11.  
  12. type
  13.   TMatrix = array [0..N - 1, 0..N - 1] of double;
  14.   TVector = array [0..N - 1] of double;
  15.  
  16. var
  17.   A, B: TMatrix;
  18.   StartTime: TDateTime;
  19.  
  20.   procedure InitializeMatrix(out M: TMatrix);
  21.   var
  22.     i, j: integer;
  23.   begin
  24.     for i := 0 to N - 1 do
  25.       for j := 0 to N - 1 do
  26.         M[i, j] := Random;
  27.   end;
  28.  
  29.   function Multiply(const mat1, mat2: TMatrix): TMatrix;
  30.   var
  31.     i, j, k: NativeUInt;
  32.     sum: double;
  33.     v: TVector;
  34.   begin
  35.     for j := 0 to N - 1 do
  36.     begin
  37.       for k := 0 to N - 1 do v[k] := mat2[k, j];
  38.       for i := 0 to N - 1 do
  39.       begin
  40.         sum := 0;
  41.         k:=0;
  42.         while k < N div 8 do
  43.         begin
  44.           sum += mat1[i, k+0] * v[k+0]
  45.                + mat1[i, k+1] * v[k+1]
  46.                + mat1[i, k+2] * v[k+2]
  47.                + mat1[i, k+3] * v[k+3]
  48.                + mat1[i, k+4] * v[k+4]
  49.                + mat1[i, k+5] * v[k+5]
  50.                + mat1[i, k+6] * v[k+6]
  51.                + mat1[i, k+7] * v[k+7];
  52.           inc(k, 8);
  53.         end;
  54.         Result[i, j] := sum;
  55.       end;
  56.     end;
  57.   end;
  58.  
  59. begin
  60.   InitializeMatrix(A);
  61.   InitializeMatrix(B);
  62.  
  63.   StartTime := Now;
  64.   Multiply(A, B);
  65.   Writeln('Execution Time: ', MilliSecondsBetween(Now, StartTime), ' ms');
  66.  
  67.   ReadLn;
  68. end.
  69.  

I may have done something wrong, I can't figure it out, my head is busy with other things, but this code speeds up calculations significantly (multiple times). But I repeat - it requires verification.
Title: Re: efficiency problem
Post by: Paolo on February 18, 2025, 12:26:51 pm
You are doing un-rolling, yes it can help.
Title: Re: efficiency problem
Post by: photor on February 18, 2025, 12:32:16 pm
I may have done something wrong, I can't figure it out, my head is busy with other things, but this code speeds up calculations significantly (multiple times). But I repeat - it requires verification.

What's your command line options for compilation?
Title: Re: efficiency problem
Post by: Thaddy on February 18, 2025, 12:32:55 pm
You are doing un-rolling, yes it can help.
The compiler does that for you and often more efficient.
Title: Re: efficiency problem
Post by: photor on February 18, 2025, 01:04:29 pm
Code: Pascal  [Select][+][-]
  1. program MatrixMultiplication;
  2. {$mode objfpc}
  3. {$optimization on}
  4.  
  5. uses
  6.   SysUtils,
  7.   DateUtils;
  8.  
  9. const
  10.   N = 1000;
  11.  
  12. type
  13.   TMatrix = array [0..N - 1, 0..N - 1] of double;
  14.   TVector = array [0..N - 1] of double;
  15.  
  16. var
  17.   A, B: TMatrix;
  18.   StartTime: TDateTime;
  19.  
  20.   procedure InitializeMatrix(out M: TMatrix);
  21.   var
  22.     i, j: integer;
  23.   begin
  24.     for i := 0 to N - 1 do
  25.       for j := 0 to N - 1 do
  26.         M[i, j] := Random;
  27.   end;
  28.  
  29.   function Multiply(const mat1, mat2: TMatrix): TMatrix;
  30.   var
  31.     i, j, k: NativeUInt;
  32.     sum: double;
  33.     v: TVector;
  34.   begin
  35.     for j := 0 to N - 1 do
  36.     begin
  37.       for k := 0 to N - 1 do v[k] := mat2[k, j];
  38.       for i := 0 to N - 1 do
  39.       begin
  40.         sum := 0;
  41.         k:=0;
  42.         while k < N div 8 do
  43.         begin
  44.           sum += mat1[i, k+0] * v[k+0]
  45.                + mat1[i, k+1] * v[k+1]
  46.                + mat1[i, k+2] * v[k+2]
  47.                + mat1[i, k+3] * v[k+3]
  48.                + mat1[i, k+4] * v[k+4]
  49.                + mat1[i, k+5] * v[k+5]
  50.                + mat1[i, k+6] * v[k+6]
  51.                + mat1[i, k+7] * v[k+7];
  52.           inc(k, 8);
  53.         end;
  54.         Result[i, j] := sum;
  55.       end;
  56.     end;
  57.   end;
  58.  
  59. begin
  60.   InitializeMatrix(A);
  61.   InitializeMatrix(B);
  62.  
  63.   StartTime := Now;
  64.   Multiply(A, B);
  65.   Writeln('Execution Time: ', MilliSecondsBetween(Now, StartTime), ' ms');
  66.  
  67.   ReadLn;
  68. end.
  69.  

I may have done something wrong, I can't figure it out, my head is busy with other things, but this code speeds up calculations significantly (multiple times). But I repeat - it requires verification.

"while k < N div 8 do" is wrong.
Title: Re: efficiency problem
Post by: Paolo on February 18, 2025, 02:18:36 pm
Now I am busy, not checked  :o

Code: Pascal  [Select][+][-]
  1.  
  2. program MatrixMultiplication;
  3. {$mode objfpc}
  4. {$optimization on}
  5.  
  6. uses
  7.   SysUtils,
  8.   DateUtils;
  9.  
  10. const
  11.   N = 1000;
  12.  
  13. type
  14.   TMatrix = array [0..N - 1, 0..N - 1] of double;
  15.   TVector = array [0..N - 1] of double;
  16.  
  17. var
  18.   A, B: TMatrix;
  19.   StartTime: TDateTime;
  20.  
  21.   procedure InitializeMatrix(out M: TMatrix);
  22.   var
  23.     i, j: integer;
  24.   begin
  25.     for i := 0 to N - 1 do
  26.       for j := 0 to N - 1 do
  27.         M[i, j] := Random;
  28.   end;
  29.  
  30.   function Multiply(const mat1, mat2: TMatrix): TMatrix;
  31.   var
  32.     i, j, k,s: NativeUInt;
  33.     sum: double;
  34.     V: Tvector;
  35.   Totalblocksize: integer;
  36.   begin
  37.   Totalblocksize:=8*((n)div(8));
  38.     for j := 0 to N - 1 do
  39.     begin
  40.       for k := 0 to N - 1 do v[k] := mat2[k, j];
  41.       for i := 0 to N - 1 do
  42.       begin
  43.         sum := 0;
  44.         k:=0;
  45.         while k < Totalblocksize do
  46.         begin
  47.           sum += mat1[i, k+0] * v[k+0]
  48.                + mat1[i, k+1] * v[k+1]
  49.                + mat1[i, k+2] * v[k+2]
  50.                + mat1[i, k+3] * v[k+3]
  51.                + mat1[i, k+4] * v[k+4]
  52.                + mat1[i, k+5] * v[k+5]
  53.                + mat1[i, k+6] * v[k+6]
  54.                + mat1[i, k+7] * v[k+7];
  55.           inc(k, 8);
  56.         end;
  57.         For s:=k to n-1 do
  58.            Sum:=sum+mat1[i,k]*v[k];
  59.         Result[i, j] := sum;
  60.       end;
  61.     end;
  62.   end;
  63.  
Title: Re: efficiency problem
Post by: ALLIGATOR on February 18, 2025, 02:40:08 pm
"while k < N div 8 do" is wrong.

 :D
Ha ha! ) You're right! A keen eye ) Thanks!!!

The corrected version, on my hardware (i7-8750H, laptop) gives x2

Code: Pascal  [Select][+][-]
  1. program MatrixMultiplication;
  2. {$mode objfpc}
  3. {$optimization on}
  4.  
  5. uses
  6.   SysUtils,
  7.   DateUtils, Math;
  8.  
  9. const
  10.   N = 1000;
  11.  
  12. type
  13.   TMatrix = array [0..N - 1, 0..N - 1] of double;
  14.   TVector = array [0..N - 1] of double;
  15.  
  16. var
  17.   A, B, R1, R2: TMatrix;
  18.   StartTime: TDateTime;
  19.  
  20.   procedure InitializeMatrix(out M: TMatrix);
  21.   var
  22.     i, j: integer;
  23.   begin
  24.     for i := 0 to N - 1 do
  25.       for j := 0 to N - 1 do
  26.         M[i, j] := Random;
  27.   end;
  28.  
  29.   function Multiply1(const mat1, mat2: TMatrix): TMatrix;
  30.    var
  31.      i, j, k: NativeUInt;
  32.      sum: double;
  33.      v: TVector;
  34.    begin
  35.      for j := 0 to N - 1 do
  36.      begin
  37.        for k := 0 to N - 1 do v[k] := mat2[k, j];
  38.        for i := 0 to N - 1 do
  39.        begin
  40.          sum := 0;
  41.          for k := 0 to N - 1 do
  42.          begin
  43.            sum := sum + mat1[i, k] * v[k];
  44.          end;
  45.          // fix
  46.          while k < N do
  47.          begin
  48.            sum += mat1[i, k] * v[k];
  49.            inc(k);
  50.          end;
  51.          // fix
  52.          Result[i, j] := sum;
  53.        end;
  54.      end;
  55.    end;
  56.  
  57.   function Multiply2(const mat1, mat2: TMatrix): TMatrix;
  58.   var
  59.     i, j, k: NativeUInt;
  60.     sum: double;
  61.     v: TVector;
  62.   begin
  63.     for j := 0 to N - 1 do
  64.     begin
  65.       for k := 0 to N - 1 do v[k] := mat2[k, j];
  66.       for i := 0 to N - 1 do
  67.       begin
  68.         sum := 0;
  69.         k:=0;
  70.         while k < N do
  71.         begin
  72.           sum += mat1[i, k+0] * v[k+0]
  73.                + mat1[i, k+1] * v[k+1]
  74.                + mat1[i, k+2] * v[k+2]
  75.                + mat1[i, k+3] * v[k+3];
  76.           inc(k, 4);
  77.         end;
  78.         Result[i, j] := sum;
  79.       end;
  80.     end;
  81.   end;
  82.  
  83. function IsSame(const mat1, mat2: TMatrix): Boolean;
  84. var
  85.   i, k: NativeUInt;
  86. begin
  87.   Result := True;
  88.   for i := 0 to N - 1 do
  89.     for k := 0 to N - 1 do
  90.       if not SameValue(mat1[i, k], mat2[i, k]) then Exit(False);
  91. end;
  92.  
  93. begin
  94.   Randomize;
  95.   InitializeMatrix(A);
  96.   InitializeMatrix(B);
  97.  
  98.   StartTime := Now;
  99.   R1 := Multiply1(A, B);
  100.   Writeln('Execution Time 1: ', MilliSecondsBetween(Now, StartTime), ' ms');
  101.  
  102.   StartTime := Now;
  103.   R2 := Multiply2(A, B);
  104.   Writeln('Execution Time 2: ', MilliSecondsBetween(Now, StartTime), ' ms');
  105.  
  106.   WriteLn('IsSame: ', IsSame(R1, R2));
  107.  
  108.   ReadLn;
  109. end.
  110.  

Code: Pascal  [Select][+][-]
  1. Execution Time 1: 1164 ms
  2. Execution Time 2: 653 ms
  3. IsSame: TRUE
  4.  

Title: Re: efficiency problem
Post by: Paolo on February 18, 2025, 03:12:36 pm
@Alligator,
but it is still wrong. You are assuming "n" be a multiple of 4  but that us not general case.
Title: Re: efficiency problem
Post by: ALLIGATOR on February 18, 2025, 03:15:05 pm
@Alligator,
but it is still wrong. You are assuming "n" be a multiple of 4  but that us not general case.
Of course! It is! I believe this is sufficient to show the acceleration effect itself.

UPD: Corrected the code to be correct for N not divisible by the loop promotion value
seems to be wrong, but it doesn't change the essence - unrolling the loop body - gives acceleration, depending on the hardware, on mine - x2.
Title: Re: efficiency problem
Post by: ALLIGATOR on February 18, 2025, 05:16:37 pm
Code: Pascal  [Select][+][-]
  1. Execution Time 1: 1245 ms
  2. Execution Time 2: 543 ms
  3. IsSame: TRUE
  4.  

Code: Pascal  [Select][+][-]
  1. program MatrixMultiplication;
  2. {$mode objfpc}
  3. {$optimization on}
  4.  
  5. uses
  6.   SysUtils,
  7.   DateUtils, Math;
  8.  
  9. const
  10.   N = 1023; // prime
  11.   unroll = 4;
  12.  
  13. type
  14.   TMatrix = array of double;
  15.   TVector = array of double;
  16.  
  17. var
  18.   A, B, R1, R2: TMatrix;
  19.   StartTime: TDateTime;
  20.  
  21.   procedure InitializeMatrix(var M: TMatrix);
  22.   var
  23.     i, j: integer;
  24.   begin
  25.     for i := 0 to N - 1 do
  26.       for j := 0 to N - 1 do
  27.         M[i*N + j] := Random;
  28.   end;
  29.  
  30.   function Multiply1(const mat1, mat2: TMatrix): TMatrix;
  31.    var
  32.      i, j, k: NativeUInt;
  33.      sum: double;
  34.      v: TVector;
  35.    begin
  36.      SetLength(v, N);
  37.      SetLength(Result, N*N);
  38.      for j := 0 to N - 1 do
  39.      begin
  40.        for k := 0 to N - 1 do v[k] := mat2[k*N + j];
  41.        for i := 0 to N - 1 do
  42.        begin
  43.          sum := 0;
  44.          for k := 0 to N - 1 do
  45.          begin
  46.            sum := sum + mat1[i*N + k] * v[k];
  47.          end;
  48.          Result[i*N + j] := sum;
  49.        end;
  50.      end;
  51.    end;
  52.  
  53.   function Multiply2(const mat1, mat2: TMatrix): TMatrix;
  54.   var
  55.     i, j, k: NativeUInt;
  56.     sum: double;
  57.     sum1,sum2,sum3,sum4: double;
  58.     v: TVector;
  59.     pm, pv: PDouble;
  60.   begin
  61.     SetLength(v, N);
  62.     SetLength(Result, N*N);
  63.     pv:=@v[0];
  64.     for j := 0 to N - 1 do
  65.     begin
  66.       for k := 0 to N - 1 do v[k] := mat2[k*N + j];
  67.       for i := 0 to N - 1 do
  68.       begin
  69.         sum := 0;
  70.         sum1:=0; sum2:=0; sum3:=0; sum4:=0;
  71.         k:=0;
  72.         pm:=@mat1[i*N];
  73.         while k < (N-unroll+1) do
  74.         begin
  75.           sum1 += pm[k+0] * pv[k+0];
  76.           sum2 += pm[k+1] * pv[k+1];
  77.           sum3 += pm[k+2] * pv[k+2];
  78.           sum4 += pm[k+3] * pv[k+3];
  79.  
  80.           inc(k, unroll);
  81.         end;
  82.  
  83.         sum:=sum1+sum2+sum3+sum4;
  84.  
  85.         while k < N do
  86.         begin
  87.           sum += mat1[i*N + k] * v[k];
  88.           inc(k);
  89.         end;
  90.         Result[i*N + j] := sum;
  91.       end;
  92.     end;
  93.   end;
  94.  
  95. function IsSame(const mat1, mat2: TMatrix): Boolean;
  96. var
  97.   i, k: NativeUInt;
  98. begin
  99.   Result := True;
  100.   for i := 0 to N - 1 do
  101.     for k := 0 to N - 1 do
  102.       if not SameValue(mat1[i*N + k], mat2[i*N + k]) then Exit(False);
  103. end;
  104.  
  105.  
  106. begin
  107.   SetLength(A, N*N);
  108.   SetLength(B, N*N);
  109.  
  110.   Randomize;
  111.   InitializeMatrix(A);
  112.   InitializeMatrix(B);
  113.  
  114.   StartTime := Now;
  115.   R1 := Multiply1(A, B);
  116.   Writeln('Execution Time 1: ', MilliSecondsBetween(Now, StartTime), ' ms');
  117.  
  118.   StartTime := Now;
  119.   R2 := Multiply2(A, B);
  120.   Writeln('Execution Time 2: ', MilliSecondsBetween(Now, StartTime), ' ms');
  121.  
  122.   WriteLn('IsSame: ', IsSame(R1, R2));
  123.  
  124.   ReadLn;
  125. end.
  126.  

Further, probably only SIMD will help and various techniques of working with cache
Title: Re: efficiency problem
Post by: photor on February 19, 2025, 06:28:46 am
@Alligator,
but it is still wrong. You are assuming "n" be a multiple of 4  but that us not general case.
Of course! It is! I believe this is sufficient to show the acceleration effect itself.

UPD: Corrected the code to be correct for N not divisible by the loop promotion value
seems to be wrong, but it doesn't change the essence - unrolling the loop body - gives acceleration, depending on the hardware, on mine - x2.

Good job. Does that mean the fp compiler is not smart enough to do unrolling automatically?
Title: Re: efficiency problem
Post by: ALLIGATOR on February 19, 2025, 07:16:27 am
Does that mean the fp compiler is not smart enough to do unrolling automatically?

At the moment, unfortunately, it is.
 (https://wiki.freepascal.org/Optimization#Optimization_Switches)
Even at the -O3 optimization level, where the “LOOPUNROLL” optimization is activated (http://where the “LOOPUNROLL” optimization is activated)
-O3: O2 + CONSTPROP + DFA + USELOADMODIFYSTORE + LOOPUNROLL

BUT! A lot of work is being done and sometime in the future it will be possible, I hope. Also, besides the native compiler/optimizer you can use the LLVM backend - the situation will probably be better there (but I haven't tried it because the LLVM backend is not available on Windows yet).

In fact, you were told correctly somewhere above - if you need to squeeze everything out of your hardware - it is better to compile this function in GCC/CLANG/ICC or use ready-made special libraries, or switch to assembler inserts/assembler functions inside Pascal.

But, you may be satisfied with this performance, which can be achieved by pure Pascal at its current level
Title: Re: efficiency problem
Post by: photor on February 19, 2025, 09:45:22 am
Does that mean the fp compiler is not smart enough to do unrolling automatically?

At the moment, unfortunately, it is.
 (https://wiki.freepascal.org/Optimization#Optimization_Switches)
Even at the -O3 optimization level, where the “LOOPUNROLL” optimization is activated (http://where the “LOOPUNROLL” optimization is activated)
-O3: O2 + CONSTPROP + DFA + USELOADMODIFYSTORE + LOOPUNROLL

BUT! A lot of work is being done and sometime in the future it will be possible, I hope. Also, besides the native compiler/optimizer you can use the LLVM backend - the situation will probably be better there (but I haven't tried it because the LLVM backend is not available on Windows yet).

In fact, you were told correctly somewhere above - if you need to squeeze everything out of your hardware - it is better to compile this function in GCC/CLANG/ICC or use ready-made special libraries, or switch to assembler inserts/assembler functions inside Pascal.

But, you may be satisfied with this performance, which can be achieved by pure Pascal at its current level

Another question: Is the above speed up by unrolling related to SIMD? And how can we know whether the compiled code uses SIMD or not? I tried to use -CfAVX2 in the command line options, but it seems to have no effect on the speed.
Title: Re: efficiency problem
Post by: ALLIGATOR on February 19, 2025, 10:23:54 am
Is the above speed up by unrolling related to SIMD?
loop rolling and SIMD are independent things
in the example above, the speedup was due to the fact that the processor was able to use more ALUs available because it realized that it could do this work in parallel at the microcommand level, i.e. calculate twice as much in one cycle.

SIMD - can give even more acceleration, because it will perform x2 - x4 times more operations per one instruction.

And how can we know whether the compiled code uses SIMD or not?
You can find out very simply by looking in the assembly code. Place a breakpoint on the code section (line) of interest, run the program, the breakpoint will be triggered - then call the assembler window - Ctrl+Shift+D (or menu “View” -> “Debug Windows” -> “Assembler”).

But here of course you will need some understanding of assembler of your processor architecture. But at the moment, as far as I know, FPC is not able to take advantage of SIMD to compile custom code. Existing implementations of various RTL subroutines are manually optimized to take full advantage of SIMD, e.g. FillChar, but this is manual optimization, automatic vectorization is not available yet, and intrinsics are also not available yet so that you can use SIMD commands in a convenient way.

I tried to use -CfAVX2 in the command line options, but it seems to have no effect on the speed.
Yes, for the most part, this key will provide little to no speedup since FPC will still use scalar versions of instructions rather than vector versions. Although... once I saw that he was able to use FMA instructions, thus speeding up the code.

By the way, adding -CfAVX2 replaced several instructions with a single SIMD FMA instruction
vfmadd231d xmm4,xmm8,[r11*8+rax+$18]
but, specifically on my hardware, it didn't give any speedup whatsoever

------
Please note that I am not a professional developer, but only an amateur, besides I am not an FPC developer and therefore my words should be treated with a healthy share of skepticism. Perhaps, if I am wrong somewhere, more experienced forum members will correct me (with concrete examples (it would be nice)).
Title: Re: efficiency problem
Post by: photor on February 19, 2025, 11:22:00 am
Is the above speed up by unrolling related to SIMD?
loop rolling and SIMD are independent things
in the example above, the speedup was due to the fact that the processor was able to use more ALUs available because it realized that it could do this work in parallel at the microcommand level, i.e. calculate twice as much in one cycle.

And how can we know whether the compiled code uses SIMD or not?
You can find out very simply by looking in the assembly code. Place a breakpoint on the code section (line) of interest, run the program, the breakpoint will be triggered - then call the assembler window - Ctrl+Shift+D (or menu “View” -> “Debug Windows” -> “Assembler”).

I tried to use -CfAVX2 in the command line options, but it seems to have no effect on the speed.
Yes, for the most part, this key will provide little to no speedup since FPC will still use scalar versions of instructions rather than vector versions. Although... once I saw that he was able to use FMA instructions, thus speeding up the code.

By the way, adding -CfAVX2 replaced several instructions with a single SIMD FMA instruction
vfmadd231d xmm4,xmm8,[r11*8+rax+$18]
but, specifically on my hardware, it didn't give any speedup whatsoever

Thanks, your words are very helpful. An important way to speed up matrix multiplication is recursive partition, see for example
https://discourse.julialang.org/t/julia-matrix-multiplication-performance/55175/11
which works very well in Julia. I tried to do similar things in fp, but not yet succeeded so far (maybe need more check).
Title: Re: efficiency problem
Post by: LV on February 19, 2025, 07:47:09 pm
https://discourse.julialang.org/t/julia-matrix-multiplication-performance/55175/11
which works very well in Julia. I tried to do similar things in fp, but not yet succeeded so far (maybe need more check).

I was a little interested in Julia programming language.  :)
I will try to transfer this recursive algorithm to Pascal. If I am not mistaken, then

Code: Pascal  [Select][+][-]
  1. program RecursiveMatrixMultiplication;
  2.  
  3. uses
  4.   SysUtils,
  5.   Math,
  6.   DateUtils;
  7.  
  8. const
  9.   N = 1000;
  10.   RECURSION_LIMIT = 256;
  11.  
  12. type
  13.   TMatrix = array [0..N - 1, 0..N - 1] of double;
  14.  
  15. var
  16.   A, B, C: TMatrix;
  17.   StartTime, EndTime: TDateTime;
  18.  
  19.   procedure InitializeMatrix(var M: TMatrix);
  20.   var
  21.     i, j: integer;
  22.   begin
  23.     for i := 0 to N - 1 do
  24.       for j := 0 to N - 1 do
  25.         M[i, j] := Random;
  26.   end;
  27.  
  28.   procedure BaseCaseMultiply(x1, y1, x2, y2, size: integer);
  29.   var
  30.     i, j, k: integer;
  31.     sum: double;
  32.   begin
  33.     for i := 0 to size - 1 do
  34.       for j := 0 to size - 1 do
  35.       begin
  36.         sum := 0.0;
  37.         for k := 0 to size - 1 do
  38.           sum := sum + A[x1 + i, y1 + k] * B[x2 + k, y2 + j];
  39.         C[x1 + i, y2 + j] := C[x1 + i, y2 + j] + sum;
  40.       end;
  41.   end;
  42.  
  43.   procedure RecursiveMultiply(x1, y1, x2, y2, size: integer);
  44.   var
  45.     half: integer;
  46.   begin
  47.     if size <= RECURSION_LIMIT then
  48.       BaseCaseMultiply(x1, y1, x2, y2, size)
  49.     else
  50.     begin
  51.       half := size div 2;
  52.  
  53.       RecursiveMultiply(x1, y1, x2, y2, half);           // A11*B11
  54.       RecursiveMultiply(x1, y1, x2, y2 + half, half);      // A11*B12
  55.       RecursiveMultiply(x1, y1 + half, x2 + half, y2, half); // A12*B21
  56.       RecursiveMultiply(x1, y1 + half, x2 + half, y2 + half, half); // A12*B22
  57.  
  58.       RecursiveMultiply(x1 + half, y1, x2, y2, half);      // A21*B11
  59.       RecursiveMultiply(x1 + half, y1, x2, y2 + half, half); // A21*B12
  60.       RecursiveMultiply(x1 + half, y1 + half, x2 + half, y2, half); // A22*B21
  61.       RecursiveMultiply(x1 + half, y1 + half, x2 + half, y2 + half, half); // A22*B22
  62.     end;
  63.   end;
  64.  
  65. begin
  66.   Randomize;
  67.   InitializeMatrix(A);
  68.   InitializeMatrix(B);
  69.   FillChar(C, SizeOf(C), 0);
  70.  
  71.   StartTime := Now;
  72.   RecursiveMultiply(0, 0, 0, 0, N);
  73.   EndTime := Now;
  74.  
  75.   Writeln('Execution Time: ', MilliSecondsBetween(EndTime, StartTime), ' ms');
  76.   Readln;
  77. end.
  78.  

Code: Text  [Select][+][-]
  1. Execution Time: 1507 ms
  2.  

Parallel multiplication of matrices according to the program from reply #91.


Code: Text  [Select][+][-]
  1. Number of threads
  2. 1                             Execution Time: 1037 ms  
  3. 2                             Execution Time: 523 ms
  4. 3                             Execution Time: 365 ms
  5. 4                             Execution Time: 298 ms
  6. 5                             Execution Time: 218 ms
  7. 6                             Execution Time: 209 ms
  8.  

Desktop i7-8700. Everywhere the optimization level is set -O3.

Updated: Graphics processors and the OpenCL framework are excellent choices for productive computing tasks. You can find open-source developments for Free Pascal by searching online.


Title: Re: efficiency problem
Post by: LV on February 20, 2025, 09:39:37 pm
If we are not focused on educational goals while multiplying two matrices, it's advisable to use libraries instead. For example, Now I spent five minutes using the AlgLib Free Edition library with low-level optimizations inside. To do this, I placed the wrapper file xalglib.pas and the DLL file alglib402_64free.dll in the program folder, then compiled and ran the program.

Desktop i7-8700
Code: Text  [Select][+][-]
  1. Execution Time: 108 ms
  2. IsSame: TRUE
  3.  

Code: Pascal  [Select][+][-]
  1. program MatrixMultiplicationALGLIB;
  2.  
  3. {$mode objfpc}
  4.  
  5. uses
  6.   SysUtils,
  7.   Math,
  8.   DateUtils,
  9.   xalglib;
  10.  
  11. const
  12.   N = 1000;
  13.  
  14. var
  15.   A, B, C, CBase: xalglib.TMatrix;
  16.   i, j: integer;
  17.   StartTime, EndTime: TDateTime;
  18.  
  19.   procedure BaseCaseMultiply(x1, y1, x2, y2, size: integer);
  20.   var
  21.     i, j, k: integer;
  22.     sum: double;
  23.   begin
  24.     for i := 0 to size - 1 do
  25.       for j := 0 to size - 1 do
  26.       begin
  27.         sum := 0.0;
  28.         for k := 0 to size - 1 do
  29.           sum := sum + A[x1 + i, y1 + k] * B[x2 + k, y2 + j];
  30.         CBase[x1 + i, y2 + j] := sum;
  31.       end;
  32.   end;
  33.  
  34.   function IsSame(const mat1, mat2: TMatrix): boolean;
  35.   var
  36.     i, k: nativeuint;
  37.   begin
  38.     Result := True;
  39.     for i := 0 to N - 1 do
  40.       for k := 0 to N - 1 do
  41.         if not SameValue(mat1[i, k], mat2[i, k]) then Exit(False);
  42.   end;
  43.  
  44. begin
  45.   SetLength(A, N, N);
  46.   SetLength(B, N, N);
  47.   SetLength(C, N, N);
  48.   SetLength(CBase, N, N);
  49.  
  50.   Randomize;
  51.   for i := 0 to N - 1 do
  52.     for j := 0 to N - 1 do
  53.     begin
  54.       A[i, j] := Random(10);
  55.       B[i, j] := Random(10);
  56.     end;
  57.  
  58.   StartTime := Now;
  59.   RMatrixGEMM(N, N, N, 1.0, A, 0, 0, 0, B, 0, 0, 0, 0.0, C, 0, 0);
  60.   EndTime := Now;
  61.  
  62.   Writeln('Execution Time: ', MilliSecondsBetween(EndTime, StartTime), ' ms');
  63.  
  64.   BaseCaseMultiply(0, 0, 0, 0, N);
  65.   WriteLn('IsSame: ', IsSame(C, CBase));
  66.  
  67.   readln;
  68. end.
  69.  

In developing this program, I tried to create a block decomposition, which gives an execution time of about 50 ms. However, I did not completely debug it. :(
Title: Re: efficiency problem
Post by: photor on February 21, 2025, 08:30:08 am
If we are not focused on educational goals while multiplying two matrices, it's advisable to use libraries instead. For example, Now I spent five minutes using the AlgLib Free Edition library with low-level optimizations inside. To do this, I placed the wrapper file xalglib.pas and the DLL file alglib402_64free.dll in the program folder, then compiled and ran the program.

But where can I get the AlgLib Free Edition library?
Title: Re: efficiency problem
Post by: ALLIGATOR on February 21, 2025, 09:31:51 am
You can also try these options:

https://github.com/mikerabat/mrmath
https://github.com/clairvoyant/cblas
Title: Re: efficiency problem
Post by: photor on February 21, 2025, 04:15:06 pm
You can also try these options:

https://github.com/mikerabat/mrmath
https://github.com/clairvoyant/cblas

Have you tried those packages? I tried to compile cblas according to the instruction, but got 'Could not find unit directory for dependency package "rtl" required for package "cblas" ' error.
mrmath doesn't even have an instruction for installation on its website.
Title: Re: efficiency problem
Post by: TRon on February 21, 2025, 04:23:52 pm
But where can I get the AlgLib Free Edition library?
lmddgtfy (https://lmddgtfy.net/?q=AlgLib%20Free%20Edition)
Title: Re: efficiency problem
Post by: ALLIGATOR on February 21, 2025, 04:30:37 pm
Have you tried those packages?

No, I just googled them, and they seemed like good candidates for a more accurate test. (Of course, it's rather expected that the source code may need to be brought up to working condition, since one of the libraries is a bit old and the other seems to be Delphi-only).
Title: Re: efficiency problem
Post by: LV on February 21, 2025, 06:03:41 pm
But where can I get the AlgLib Free Edition library?

I returned from work and saw that I received help with the answer. Thanks, @TRon :)
@photor. You can download the zip file from https://www.alglib.net/download.php. Once downloaded, unzip it and copy both the wrapper and the DLL files to the program folder. That's all you need to do.
Also, @ALLIGATOR, you wrote some excellent code above. Would you mind if I attempted to parallelize it?

Code: Pascal  [Select][+][-]
  1. program MatrixMultiplication;
  2. {$mode objfpc}
  3. {$optimization on}
  4. {$MODESWITCH CLASSICPROCVARS+}
  5. {$LONGSTRINGS ON}
  6.  
  7. uses
  8.   SysUtils,
  9.   Classes,
  10.   DateUtils,
  11.   Math;
  12.  
  13. const
  14.   N = 1000;
  15.   unroll = 4;
  16.   ThreadCount = 6;
  17.  
  18. type
  19.   TMatrix = array of double;
  20.   TVector = array of double;
  21.  
  22. var
  23.   A, B, R1, R2, R3: TMatrix;
  24.   StartTime: TDateTime;
  25.  
  26.   procedure InitializeMatrix(var M: TMatrix);
  27.   var
  28.     i, j: integer;
  29.   begin
  30.     for i := 0 to N - 1 do
  31.       for j := 0 to N - 1 do
  32.         M[i * N + j] := Random;
  33.   end;
  34.  
  35.   function Multiply1(const mat1, mat2: TMatrix): TMatrix;
  36.   var
  37.     i, j, k: nativeuint;
  38.     sum: double;
  39.     v: TVector;
  40.   begin
  41.     SetLength(v, N);
  42.     SetLength(Result, N * N);
  43.     for j := 0 to N - 1 do
  44.     begin
  45.       for k := 0 to N - 1 do
  46.         v[k] := mat2[k * N + j];
  47.       for i := 0 to N - 1 do
  48.       begin
  49.         sum := 0;
  50.         for k := 0 to N - 1 do
  51.           sum := sum + mat1[i * N + k] * v[k];
  52.         Result[i * N + j] := sum;
  53.       end;
  54.     end;
  55.   end;
  56.  
  57.   procedure Multiply2Part(const mat1, mat2: TMatrix; var Result: TMatrix;
  58.     start_i, end_i: nativeuint);
  59.   var
  60.     i, j, k: nativeuint;
  61.     sum: double;
  62.     sum1, sum2, sum3, sum4: double;
  63.     v: TVector;
  64.     pm, pv: PDouble;
  65.   begin
  66.     SetLength(v, N);
  67.     for j := 0 to N - 1 do
  68.     begin
  69.       for k := 0 to N - 1 do
  70.         v[k] := mat2[k * N + j];
  71.       for i := start_i to end_i do
  72.       begin
  73.         sum := 0;
  74.         sum1 := 0;
  75.         sum2 := 0;
  76.         sum3 := 0;
  77.         sum4 := 0;
  78.         k := 0;
  79.         pm := @mat1[i * N];
  80.         pv := @v[0];
  81.         while k < (N - unroll + 1) do
  82.         begin
  83.           sum1 := sum1 + pm[k] * pv[k];
  84.           sum2 := sum2 + pm[k + 1] * pv[k + 1];
  85.           sum3 := sum3 + pm[k + 2] * pv[k + 2];
  86.           sum4 := sum4 + pm[k + 3] * pv[k + 3];
  87.           Inc(k, unroll);
  88.         end;
  89.         sum := sum1 + sum2 + sum3 + sum4;
  90.         while k < N do
  91.         begin
  92.           sum := sum + pm[k] * pv[k];
  93.           Inc(k);
  94.         end;
  95.         Result[i * N + j] := sum;
  96.       end;
  97.     end;
  98.   end;
  99.  
  100. type
  101.   TMultThread = class(TThread)
  102.   private
  103.     FStartI, FEndI: nativeuint;
  104.     FMat1, FMat2: TMatrix;
  105.     FResult: TMatrix;
  106.   public
  107.     constructor Create(StartI, EndI: nativeuint; const Mat1, Mat2: TMatrix;
  108.       var ResultMat: TMatrix);
  109.     procedure Execute; override;
  110.   end;
  111.  
  112.   constructor TMultThread.Create(StartI, EndI: nativeuint;
  113.   const Mat1, Mat2: TMatrix; var ResultMat: TMatrix);
  114.   begin
  115.     inherited Create(True);
  116.     FStartI := StartI;
  117.     FEndI := EndI;
  118.     FMat1 := Mat1;
  119.     FMat2 := Mat2;
  120.     FResult := ResultMat;
  121.     FreeOnTerminate := False;
  122.   end;
  123.  
  124.   procedure TMultThread.Execute;
  125.   begin
  126.     Multiply2Part(FMat1, FMat2, FResult, FStartI, FEndI);
  127.   end;
  128.  
  129.   function MultiplyThreaded(const mat1, mat2: TMatrix): TMatrix;
  130.   var
  131.     i: integer;
  132.     start_i, end_i: nativeuint;
  133.     rowsPerThread, remainder: nativeuint;
  134.     Threads: array of TMultThread;
  135.   begin
  136.     SetLength(Result, N * N);
  137.     rowsPerThread := N div ThreadCount;
  138.     remainder := N mod ThreadCount;
  139.  
  140.     SetLength(Threads, ThreadCount);
  141.     for i := 0 to ThreadCount - 1 do
  142.     begin
  143.       start_i := i * rowsPerThread;
  144.       end_i := start_i + rowsPerThread - 1;
  145.       if i = ThreadCount - 1 then
  146.         end_i := end_i + remainder;
  147.       Threads[i] := TMultThread.Create(start_i, end_i, mat1, mat2, Result);
  148.     end;
  149.  
  150.     for i := 0 to ThreadCount - 1 do
  151.       Threads[i].Start;
  152.  
  153.     for i := 0 to ThreadCount - 1 do
  154.       Threads[i].WaitFor;
  155.  
  156.     for i := 0 to ThreadCount - 1 do
  157.       Threads[i].Free;
  158.   end;
  159.  
  160.   function IsSame(const mat1, mat2: TMatrix): boolean;
  161.   var
  162.     i, k: nativeuint;
  163.   begin
  164.     Result := True;
  165.     for i := 0 to N - 1 do
  166.       for k := 0 to N - 1 do
  167.         if not SameValue(mat1[i * N + k], mat2[i * N + k]) then
  168.           Exit(False);
  169.   end;
  170.  
  171. begin
  172.   SetLength(A, N * N);
  173.   SetLength(B, N * N);
  174.  
  175.   Randomize;
  176.   InitializeMatrix(A);
  177.   InitializeMatrix(B);
  178.  
  179.   StartTime := Now;
  180.   R1 := Multiply1(A, B);
  181.   Writeln('Execution Time 1: ', MilliSecondsBetween(Now, StartTime), ' ms');
  182.  
  183.   StartTime := Now;
  184.   R2 := MultiplyThreaded(A, B);
  185.   Writeln('Execution Time 2 (Threaded): ', MilliSecondsBetween(Now, StartTime), ' ms');
  186.  
  187.   WriteLn('IsSame: ', IsSame(R1, R2));
  188.  
  189.   ReadLn;
  190. end.
  191.  

Desktop i7-8700. The optimization level is set -O3.

Code: Text  [Select][+][-]
  1. Execution Time 1: 987 ms
  2. Execution Time 2 (Threaded): 71 ms
  3. IsSame: TRUE
  4.  

I think it's a pretty good result. :)
Title: Re: efficiency problem
Post by: photor on February 23, 2025, 11:54:36 am
But where can I get the AlgLib Free Edition library?

I returned from work and saw that I received help with the answer. Thanks, @TRon :)
@photor. You can download the zip file from https://www.alglib.net/download.php. Once downloaded, unzip it and copy both the wrapper and the DLL files to the program folder. That's all you need to do.
Also, @ALLIGATOR, you wrote some excellent code above. Would you mind if I attempted to parallelize it?

Code: Text  [Select][+][-]
  1. Execution Time 1: 987 ms
  2. Execution Time 2 (Threaded): 71 ms
  3. IsSame: TRUE
  4.  

I think it's a pretty good result. :)

Good job! Possible to parallelize the previous version using AlgLib Free Edition library?
Title: Re: efficiency problem
Post by: LV on February 25, 2025, 07:48:54 pm
Possible to parallelize the previous version using AlgLib Free Edition library?

I attempted to parallelize the AlgLib Free Edition library. It compiles and executes with a runtime of 50 to 60 ms. However, the results fail the check: IsSame: FALSE. It is suspected that the library is not thread-safe.

Out of curiosity, my colleague ran the same task on MATLAB R2015b (64-bit) and reported an execution time of 90 to 110 ms.

The execution time of the program (Reply #116), which was written in pure Object Pascal, is between 70 and 80 ms.

I must admit that a couple of years ago, I was initially skeptical about using FPC and Lazarus. Now I am convinced that Free Pascal is quite suitable for this type of task.  ;)

P.S. All tests were conducted on a desktop with an Intel i7 8700 processor running Windows 11.

Updated. Matrix multiplication in NumPy takes about 10-15 ms. This library appears to solve the problem using GPU with cuBLAS (NVIDIA)?
Title: Re: efficiency problem
Post by: photor on February 28, 2025, 02:14:02 pm
Possible to parallelize the previous version using AlgLib Free Edition library?

I attempted to parallelize the AlgLib Free Edition library. It compiles and executes with a runtime of 50 to 60 ms. However, the results fail the check: IsSame: FALSE. It is suspected that the library is not thread-safe.

Out of curiosity, my colleague ran the same task on MATLAB R2015b (64-bit) and reported an execution time of 90 to 110 ms.

The execution time of the program (Reply #116), which was written in pure Object Pascal, is between 70 and 80 ms.

I must admit that a couple of years ago, I was initially skeptical about using FPC and Lazarus. Now I am convinced that Free Pascal is quite suitable for this type of task.  ;)

P.S. All tests were conducted on a desktop with an Intel i7 8700 processor running Windows 11.

Updated. Matrix multiplication in NumPy takes about 10-15 ms. This library appears to solve the problem using GPU with cuBLAS (NVIDIA)?

Thanks for the exploration. For my system, an Intel i7 4700 processor running Windows 11, numpy with OpenBLAS as the backend takes about 45 ms in the single-threaded case. With 4 threads, it takes about 15ms. So, anyway, AlgLib's 108 ms with a single thread is not bad. But the multi-threaded version of AlgLib is not free, unfortunately.
Title: Re: efficiency problem
Post by: LV on March 01, 2025, 08:25:43 pm
So, anyway, AlgLib's 108 ms with a single thread is not bad. But the multi-threaded version of AlgLib is not free, unfortunately.

This evening, I had some free time and decided to parallelize the free version of the AlgLib library (see the attached code). Below are the results of comparative calculations, with an average derived from three program runs on a desktop with an Intel i7 8700 processor, which has 6 cores and 12 threads.

Code: Text  [Select][+][-]
  1. Code I (one-thrеaded) from Reply #89               966 ms
  2. Code II (one-thrеaded) from Reply #103             363 ms
  3. Code III (Code II four-thrеaded) from Reply #116    104 ms
  4. Code IV (AlgLib one-thrеaded) from Reply #110      107 ms
  5. Code V (AlgLib four-thrеaded) see attached code            53  ms
  6. NumPy (twelve-threaded?)                                            15  ms
  7.  

Code: Pascal  [Select][+][-]
  1. program MatrixMultiplicationALGLIB;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. uses
  6.   SysUtils,
  7.   Math,
  8.   DateUtils,
  9.   Classes,
  10.   xalglib;
  11.  
  12. const
  13.   N = 1000;
  14.  
  15. type
  16.   TMatrixMultiplier = class(TThread)
  17.   public
  18.     A, B, C: xalglib.TMatrix;
  19.     OffsetA, OffsetB, OffsetC: Integer;
  20.     Rows, Cols: Integer;
  21.     procedure Execute; override;
  22.   end;
  23.  
  24. var
  25.   A, B, C, CBase: xalglib.TMatrix;
  26.   A1, A2, B1, B2: xalglib.TMatrix;
  27.   C11, C12, C21, C22: xalglib.TMatrix;
  28.   i, j, halfN, restN: integer;
  29.   StartTime, EndTime: TDateTime;
  30.   Threads: array[0..3] of TThread;
  31.  
  32. procedure TMatrixMultiplier.Execute;
  33. begin
  34.   try
  35.     if (Length(A) > 0) and (Length(B) > 0) then
  36.       RMatrixGEMM(Rows, Cols, N, 1.0,
  37.         A, OffsetA, 0, 0,
  38.         B, 0, OffsetB, 0,
  39.         0.0, C, OffsetC, 0);
  40.   except
  41.     on E: Exception do
  42.       Writeln('Thread error: ', E.ClassName, ': ', E.Message);
  43.   end;
  44. end;
  45.  
  46. procedure InitializeAlignedMatrix(var M: xalglib.TMatrix; Rows, Cols: Integer);
  47. var
  48.   i: Integer;
  49. begin
  50.   SetLength(M, Rows);
  51.   for i := 0 to Rows - 1 do
  52.   begin
  53.     SetLength(M[i], Cols);
  54.     FillChar(M[i][0], Cols * SizeOf(Double), 0);
  55.   end;
  56. end;
  57.  
  58. procedure GenerateRandomMatrix(var M: xalglib.TMatrix; Rows, Cols: Integer);
  59. var
  60.   i, j: Integer;
  61. begin
  62.   InitializeAlignedMatrix(M, Rows, Cols);
  63.   for i := 0 to Rows - 1 do
  64.     for j := 0 to Cols - 1 do
  65.       M[i][j] := Random(10);
  66. end;
  67.  
  68. function MatricesEqual(const Mat1, Mat2: xalglib.TMatrix): Boolean;
  69. var
  70.   i, j: Integer;
  71. begin
  72.   Result := True;
  73.   for i := 0 to High(Mat1) do
  74.     for j := 0 to High(Mat1[i]) do
  75.       if not SameValue(Mat1[i][j], Mat2[i][j], 1e-9) then
  76.         Exit(False);
  77. end;
  78.  
  79. begin
  80.   try
  81.     // Initialize matrices with alignment
  82.     GenerateRandomMatrix(A, N, N);
  83.     GenerateRandomMatrix(B, N, N);
  84.     InitializeAlignedMatrix(C, N, N);
  85.     InitializeAlignedMatrix(CBase, N, N);
  86.  
  87.     // Standard multiplication
  88.     StartTime := Now;
  89.     RMatrixGEMM(N, N, N, 1.0, A, 0, 0, 0, B, 0, 0, 0, 0.0, C, 0, 0);
  90.     EndTime := Now;
  91.     Writeln('Standard: ', MilliSecondsBetween(EndTime, StartTime), ' ms');
  92.  
  93.     // Parallel block multiplication
  94.     StartTime := Now;
  95.     halfN := N div 2;
  96.     restN := N - halfN;
  97.  
  98.     // Split matrix A
  99.     SetLength(A1, halfN);
  100.     SetLength(A2, restN);
  101.     for i := 0 to halfN - 1 do
  102.     begin
  103.       A1[i] := Copy(A[i], 0, N);
  104.       A2[i] := Copy(A[i + halfN], 0, N);
  105.     end;
  106.  
  107.     // Split matrix B
  108.     SetLength(B1, N, halfN);
  109.     SetLength(B2, N, halfN);
  110.     for i := 0 to N - 1 do
  111.     begin
  112.       B1[i] := Copy(B[i], 0, halfN);
  113.       B2[i] := Copy(B[i], halfN, halfN);
  114.     end;
  115.  
  116.     // Initialize result matrices
  117.     InitializeAlignedMatrix(C11, halfN, halfN);
  118.     InitializeAlignedMatrix(C12, halfN, restN);
  119.     InitializeAlignedMatrix(C21, restN, halfN);
  120.     InitializeAlignedMatrix(C22, restN, restN);
  121.  
  122.     // Configure threads
  123.     Threads[0] := TMatrixMultiplier.Create(True);
  124.     with TMatrixMultiplier(Threads[0]) do begin
  125.       A := A1; B := B1; C := C11;
  126.       OffsetA := 0; OffsetB := 0; OffsetC := 0;
  127.       Rows := halfN; Cols := halfN;
  128.       FreeOnTerminate := False;
  129.     end;
  130.  
  131.     Threads[1] := TMatrixMultiplier.Create(True);
  132.     with TMatrixMultiplier(Threads[1]) do begin
  133.       A := A2; B := B2; C := C22;
  134.       OffsetA := 0; OffsetB := 0; OffsetC := 0;
  135.       Rows := restN; Cols := restN;
  136.       FreeOnTerminate := False;
  137.     end;
  138.  
  139.     Threads[2] := TMatrixMultiplier.Create(True);
  140.     with TMatrixMultiplier(Threads[2]) do begin
  141.       A := A1; B := B2; C := C12;
  142.       OffsetA := 0; OffsetB := 0; OffsetC := 0;
  143.       Rows := halfN; Cols := restN;
  144.       FreeOnTerminate := False;
  145.     end;
  146.  
  147.     Threads[3] := TMatrixMultiplier.Create(True);
  148.     with TMatrixMultiplier(Threads[3]) do begin
  149.       A := A2; B := B1; C := C21;
  150.       OffsetA := 0; OffsetB := 0; OffsetC := 0;
  151.       Rows := restN; Cols := halfN;
  152.       FreeOnTerminate := False;
  153.     end;
  154.  
  155.     // Start threads
  156.     for i := 0 to 3 do Threads[i].Start;
  157.     for i := 0 to 3 do Threads[i].WaitFor;
  158.  
  159.     // Assemble results
  160.     for i := 0 to halfN - 1 do
  161.     begin
  162.       if (halfN > 0) and (Length(C11[i]) = halfN) then
  163.         Move(C11[i][0], C[i][0], halfN * SizeOf(Double));
  164.       if (restN > 0) and (Length(C12[i]) = restN) then
  165.         Move(C12[i][0], C[i][halfN], restN * SizeOf(Double));
  166.     end;
  167.  
  168.     for i := 0 to restN - 1 do
  169.     begin
  170.       if (halfN > 0) and (Length(C21[i]) = halfN) then
  171.         Move(C21[i][0], C[i + halfN][0], halfN * SizeOf(Double));
  172.       if (restN > 0) and (Length(C22[i]) = restN) then
  173.         Move(C22[i][0], C[i + halfN][halfN], restN * SizeOf(Double));
  174.     end;
  175.  
  176.     EndTime := Now;
  177.     Writeln('Parallel: ', MilliSecondsBetween(EndTime, StartTime), ' ms');
  178.  
  179.     // Validation
  180.     StartTime := Now;
  181.     RMatrixGEMM(N, N, N, 1.0, A, 0, 0, 0, B, 0, 0, 0, 0.0, CBase, 0, 0);
  182.     EndTime := Now;
  183.     WriteLn('Validation: ', MatricesEqual(C, CBase));
  184.  
  185.     // Cleanup
  186.     for i := 0 to 3 do Threads[i].Free;
  187.  
  188.   except
  189.     on E: Exception do
  190.       Writeln('Main error: ', E.ClassName, ': ', E.Message);
  191.   end;
  192.  
  193.   Readln;
  194. end.
  195.  
Title: Re: efficiency problem
Post by: Paolo on March 01, 2025, 10:22:23 pm
Hello,

please try this instead of code replay #89 and tell me if things improve

Code: Pascal  [Select][+][-]
  1. const
  2.   gN = 1000;
  3.  
  4. type
  5.   TVector = array [0..gN-1] of double;
  6.   TMatrix = array [0..gN-1] of TVector;
  7.   PTVector = ^TVector;
  8.  
  9. ...
  10. ...
  11.  
  12. procedure TForm1.IKJ_VCT(var AMat1, AMat2, AMat3: TMatrix);
  13. var
  14.   i, j, k : integer;
  15.   A_ik : double;
  16.   PVct1, PVct2, PVct3 : PTVector;
  17. begin
  18.   for i:=0 to gN-1 do begin
  19.     PVct1:=@AMat1[i];
  20.     PVct3:=@AMat3[i];
  21.     for k:=0 to gN-1 do begin
  22.       A_ik:=PVct1^[k];
  23.       PVct2:=@AMat2[k];
  24.       for j:=0 to gN-1 do
  25.         PVct3^[j]:=PVct3^[j]+A_ik*PVct2^[j]
  26.     end
  27.   end
  28. end;
  29.  
  30.  

Title: Re: efficiency problem
Post by: LV on March 02, 2025, 08:46:41 am
Hi @Paolo! Your suggestion improves performance by approximately 30% compared to the code in reply #89 (see attached code).

Code: Pascal  [Select][+][-]
  1. program MatrixMultiplication;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. uses
  6.   SysUtils, Math, DateUtils;
  7.  
  8. const
  9.   gN = 1000;
  10.  
  11. type
  12.   TVector = array[0..gN-1] of Double;
  13.   TMatrix = array[0..gN-1] of TVector;
  14.   PTVector = ^TVector;
  15.  
  16. // Matrix initialization with random values
  17. procedure RandomizeMatrix(var M: TMatrix);
  18. var
  19.   i, j: Integer;
  20. begin
  21.   Randomize;
  22.   for i := 0 to gN - 1 do
  23.     for j := 0 to gN - 1 do
  24.       M[i][j] := Random * 1000;
  25. end;
  26.  
  27. // Optimized multiplication (IKJ order)
  28. procedure IKJ_VCT(var AMat1, AMat2, AMat3: TMatrix);
  29. var
  30.   i, j, k: Integer;
  31.   A_ik: Double;
  32.   PVct1, PVct2, PVct3: PTVector;
  33. begin
  34.   for i := 0 to gN - 1 do
  35.   begin
  36.     PVct1 := @AMat1[i];
  37.     PVct3 := @AMat3[i];
  38.     for k := 0 to gN - 1 do
  39.     begin
  40.       A_ik := PVct1^[k];
  41.       PVct2 := @AMat2[k];
  42.       for j := 0 to gN - 1 do
  43.         PVct3^[j] := PVct3^[j] + A_ik * PVct2^[j];
  44.     end;
  45.   end;
  46. end;
  47.  
  48. // Matrix multiplication (modified version)
  49. procedure ModifiedMultiply(const mat1, mat2: TMatrix; var res: TMatrix);
  50. var
  51.   i, j, k: Integer;
  52.   sum: Double;
  53.   v: TVector;
  54. begin
  55.   FillChar(res, SizeOf(res), 0);
  56.   for j := 0 to gN - 1 do
  57.   begin
  58.     for k := 0 to gN - 1 do
  59.       v[k] := mat2[k][j];
  60.     for i := 0 to gN - 1 do
  61.     begin
  62.       sum := 0;
  63.       for k := 0 to gN - 1 do
  64.         sum := sum + mat1[i][k] * v[k];
  65.       res[i][j] := sum;
  66.     end;
  67.   end;
  68. end;
  69.  
  70. // Matrix comparison
  71. function IsSame(const mat1, mat2: TMatrix): Boolean;
  72. var
  73.   i, k: NativeUInt;
  74. begin
  75.   for i := 0 to gN - 1 do
  76.     for k := 0 to gN - 1 do
  77.       if not SameValue(mat1[i][k], mat2[i][k]) then
  78.         Exit(False);
  79.   Result := True;
  80. end;
  81.  
  82. var
  83.   Mat1, Mat2, MatRes, MatExpected: TMatrix;
  84.   startTime: TDateTime;
  85.   elapsedOpt, elapsedModified: Int64;
  86.   i, j: Integer;
  87. begin
  88.   // Initialize matrices
  89.   RandomizeMatrix(Mat1);
  90.   RandomizeMatrix(Mat2);
  91.  
  92.   // Clear result matrices
  93.   FillChar(MatRes, SizeOf(MatRes), 0);
  94.   FillChar(MatExpected, SizeOf(MatExpected), 0);
  95.  
  96.   // Test optimized version
  97.   startTime := Now;
  98.   IKJ_VCT(Mat1, Mat2, MatRes);
  99.   elapsedOpt := MilliSecondsBetween(startTime, Now);
  100.  
  101.   // Test modified version
  102.   startTime := Now;
  103.   ModifiedMultiply(Mat1, Mat2, MatExpected);
  104.   elapsedModified := MilliSecondsBetween(startTime, Now);
  105.  
  106.   // Verify results
  107.   WriteLn('Verification: ', IsSame(MatRes, MatExpected));
  108.   WriteLn('Optimized time: ', elapsedOpt, ' ms');
  109.   WriteLn('Modified time: ', elapsedModified, ' ms');
  110.   WriteLn('Speed ratio: ', (elapsedModified/elapsedOpt):0:1, 'x');
  111.   Readln;
  112. end.
  113.  

Performance Comparison Table Relative to NumPy:

Code: Text  [Select][+][-]
  1. Code I (one-thrеaded) from Reply #89                  64.4
  2. Code II (one-thrеaded) from Reply #121                51.0
  3. Code III (one-thrеaded) from Reply #103               24.2
  4. Code IV (AlgLib one-thrеaded) from Reply #110         7.1
  5. Code V (Code III four-thrеaded) from Reply #116       7.0
  6. Code VI (AlgLib four-thrеaded) from Reply #120        3.5
  7. NumPy (twelve-threaded?)                              1.0
  8.  
Title: Re: efficiency problem
Post by: photor on March 02, 2025, 09:21:31 am
Performance Comparison Table Relative to NumPy:

Code: Text  [Select][+][-]
  1. Code I (one-thrеaded) from Reply #89                  64.4
  2. Code II (one-thrеaded) from Reply #121                51.0
  3. Code III (one-thrеaded) from Reply #103               24.2
  4. Code IV (AlgLib one-thrеaded) from Reply #110         7.1
  5. Code V (Code III four-thrеaded) from Reply #116       7.0
  6. Code VI (AlgLib four-thrеaded) from Reply #120        3.5
  7. NumPy (twelve-threaded?)                              1.0
  8.  

numpy is 8-threaded by default (on most of modern computers).

updated: On my computer (kind of old, intel i7 4700), numpy is 4-threaded by default using MKL on Win 11.
Title: Re: efficiency problem
Post by: LV on March 02, 2025, 10:33:22 am
Hi @photor! Note: NumPy on my machine uses the MKL library. I tested performance with one and then four threads using that library. Here are the results:

Code: Text  [Select][+][-]
  1.                                                     ms
  2. -------------------------------------------------------
  3. Code I (one-thrеaded) from Reply #89                966
  4. Code II (one-thrеaded) from Reply #121              765
  5. Code III (one-thrеaded) from Reply #103             363
  6. Code IV (AlgLib one-thrеaded) from Reply #110       107
  7. Code V (Code III four-thrеaded) from Reply #116     104
  8. Code VI (AlgLib four-thrеaded) from Reply #120      53
  9. NumPy (one-thrеaded MKL)                            37
  10. NumPy (four-thrеaded MKL)                           14
  11. --------------------------------------------------------
  12.  
Title: Re: efficiency problem
Post by: photor on March 02, 2025, 12:52:39 pm
Hi @photor! Note: NumPy on my machine uses the MKL library. I tested performance with one and then four threads using that library. Here are the results:

Code: Text  [Select][+][-]
  1.                                                     ms
  2. -------------------------------------------------------
  3. Code I (one-thrеaded) from Reply #89                966
  4. Code II (one-thrеaded) from Reply #121              765
  5. Code III (one-thrеaded) from Reply #103             363
  6. Code IV (AlgLib one-thrеaded) from Reply #110       107
  7. Code V (Code III four-thrеaded) from Reply #116     104
  8. Code VI (AlgLib four-thrеaded) from Reply #120      53
  9. NumPy (one-thrеaded MKL)                            37
  10. NumPy (four-thrеaded MKL)                           14
  11. --------------------------------------------------------
  12.  

Thanks. Typically numpy uses MKL or openblas as the backend. Actually openblas is a completely free library written by C, so can free pascal directly call it instead of AlgLib?
Title: Re: efficiency problem
Post by: LV on March 02, 2025, 01:28:25 pm
Thanks. Typically numpy uses MKL or openblas as the backend. Actually openblas is a completely free library written by C, so can free pascal directly call it instead of AlgLib?

The short answer is yes. There's no need to "reinvent the wheel" regarding standard algebraic operations. However, implementing non-standard algorithms may require different considerations. Additionally, when it's important to quickly develop a graphical interface. There may also be other reasons to choose FPC and Lazarus for your project. The issue of efficiency is complex and has many dimensions.  :)
Title: Re: efficiency problem
Post by: Paolo on March 11, 2025, 10:18:08 pm
my last attempt with one single thread, can you check the speed ?

I made the size of array variable (mode = Delphi, no range check)

Code: Pascal  [Select][+][-]
  1.  
  2. type
  3.   T0Arr = array [0..0] of double;
  4.   TP0Arr = ^T0Arr;
  5. ...
  6. procedure TForm1.IKJ_MEM2BIS(var AN: integer; var AM1, AM2, AM3: TP0Arr);
  7. var
  8.   M1ik : double;
  9.   M2k, M3ij, M1i, M3i : ^double;
  10.   i, j, k : integer;
  11.   C : integer;
  12. begin                            
  13.   M1i:=@AM1[0];                  
  14.   for i:=0 to An-1 do begin
  15.     M3i:=@AM3[i*An];
  16.     M2k:=@AM2[0];                
  17.     for k:=0 to An-1 do begin
  18.       M1ik:=M1i^;
  19.       M3ij:=M3i;
  20.       for j:=0 to An-1 do begin
  21.         M3ij^:=M3ij^+M1ik*M2k^;      
  22.         Inc(M2k);
  23.         Inc(M3ij)
  24.       end;
  25.       Inc(M1i)
  26.     end
  27.   end
  28. end;
  29.  
  30.  

usage

Code: Pascal  [Select][+][-]
  1. ..
  2. var
  3.    M1MEM, M2MEM, M3MEM : TP0Arr;
  4.  
  5. ...
  6.   m:=1000;
  7.   GetMem(M1MEM, m*m*8);
  8.   GetMem(M2MEM, m*m*8);
  9.   GetMem(M3MEM, m*m*8);
  10.  
  11.   IKJ_MEM2BIS(m, M1MEM, M2MEM, M3MEM);
  12.  
  13.   FreeMem(M1MEM);
  14.   FreeMem(M2MEM);
  15.   FreeMem(M3MEM);
  16.  
Title: Re: efficiency problem
Post by: photor on March 13, 2025, 03:28:45 am
my last attempt with one single thread, can you check the speed ?

I made the size of array variable (mode = Delphi, no range check)

Tested, it doesn't help much.
Recently, I had better understanding on high-performance matrix multiplication. I will post something soon, when I have some spare time.
Title: Re: efficiency problem
Post by: Paolo on March 13, 2025, 08:38:46 am
yes. about 10% better.

in any case can we say that between the two codes below the difference in speed is high, making solution based on dynamic array not really usable !

same code with static array...

Code: Pascal  [Select][+][-]
  1. ..
  2.   TVector = array [0..gN-1] of double;
  3.   TMatrix = array [0..gN-1] of TVector;
  4. ..
  5. procedure TForm1.IJK_STD(var AMat1, AMat2, AMat3: TMatrix);
  6. var
  7.   i, j, k : integer;
  8.   Sum : Double;
  9. begin
  10.   for i:=0 to gN-1 do
  11.     for j:=0 to gN-1 do begin
  12.       Sum:=0;
  13.       for k:=0 to gN-1 do
  14.         Sum:=Sum+AMat1[i,k]*AMat2[k,j];
  15.       AMat3[i,j]:=Sum
  16.     end
  17. end;
  18.  
  19.  

...and with dynamic array

Code: Pascal  [Select][+][-]
  1. ..
  2.   TArrayDouble = array of array of double;
  3. ..
  4. procedure TForm1.IJK_DynArr(var AMat1, AMat2, AMat3: TArrayDouble);
  5. var
  6.   i, j, k : integer;
  7.   Sum : Double;
  8. begin
  9.   for i:=0 to gN-1 do
  10.     for j:=0 to gN-1 do begin
  11.       Sum:=0;
  12.       for k:=0 to gN-1 do
  13.         Sum:=Sum+AMat1[i,k]*AMat2[k,j];
  14.       AMat3[i,j]:=Sum
  15.     end
  16. end;
  17.  

Quote
I will post something soon, when I have some spare time

good to know.
Title: Re: efficiency problem
Post by: Thaddy on March 13, 2025, 12:46:29 pm
Once memory for the arrays is allocated there is no speed difference between static and dynamic arrays as long as the code to access the array elements is exactly the same. Dynamic arrays only have a slight speed penalty at allocation time. Then again, a static array may affect the load time of the executable.
Title: Re: efficiency problem
Post by: Khrys on March 13, 2025, 01:30:16 pm
Once memory for the arrays is allocated there is no speed difference between static and dynamic arrays as long as the code to access the array elements is exactly the same. Dynamic arrays only have a slight speed penalty at allocation time. Then again, a static array may affect the load time of the executable.

Did you even read the code?

This has nothing to do with static/dynamic memory allocation - it's about fixed-size, contiguous, true 2D arrays vs. dynamically-sized arrays of arrays, the latter of which requires an extra pointer dereference and may have terrible memory locality.
Title: Re: efficiency problem
Post by: Paolo on March 13, 2025, 01:30:27 pm
Hi Thaddy, copy-paste and run the two codes above and let me know what you see in terms of speed.
Here win64  fpc 322 and -o4 optimization.
Title: Re: efficiency problem
Post by: LV on March 14, 2025, 05:51:11 pm
As part of the exercise, I:
1. downloaded the libopenblas.dll file from https://sourceforge.net/projects/openblas/.
2. wrote a program.

Code: Pascal  [Select][+][-]
  1. program MatrixTest;
  2.  
  3. uses
  4.   SysUtils,
  5.   CTypes,
  6.   Math;
  7.  
  8. const
  9.   N = 1000;
  10.   EPSILON = 1e-6;
  11.  
  12. var
  13.   A, B, C_Blas, C_Naive: array of double;
  14.   i, j, k: integer;
  15.   start_time, end_time: QWord;
  16.   max_diff: double;
  17.   alpha, beta: double;
  18.   mm, nn, k_val: integer;
  19.  
  20. procedure dgemm(transa: PChar; transb: PChar; m: PCInt; n: PCInt;
  21.   k: PCInt; alpha: PDouble; A: PDouble; lda: PCInt; B: PDouble;
  22.   ldb: PCInt; beta: PDouble; C: PDouble; ldc: PCInt); cdecl;
  23.   external 'libopenblas' Name 'dgemm_';
  24.  
  25.   procedure NaiveMatrixMultiplyColMajor;
  26.   var
  27.     sum: double;
  28.   begin
  29.     for i := 0 to N - 1 do
  30.       for j := 0 to N - 1 do
  31.       begin
  32.         sum := 0.0;
  33.         for k := 0 to N - 1 do
  34.           sum := sum + A[i + k * N] * B[k + j * N];
  35.         C_Naive[i + j * N] := sum;
  36.       end;
  37.   end;
  38.  
  39. begin
  40.   Randomize;
  41.  
  42.   SetLength(A, N * N);
  43.   SetLength(B, N * N);
  44.   SetLength(C_Blas, N * N);
  45.   SetLength(C_Naive, N * N);
  46.  
  47.  
  48.   for i := 0 to N - 1 do
  49.     for j := 0 to N - 1 do
  50.     begin
  51.       A[i + j * N] := Random;
  52.       B[i + j * N] := Random;
  53.     end;
  54.  
  55.   mm := N;
  56.   nn := N;
  57.   k_val := N;
  58.   alpha := 1.0;
  59.   beta := 0.0;
  60.  
  61.  
  62.   start_time := GetTickCount64;
  63.   dgemm('N', 'N', @mm, @nn, @k_val, @alpha, @A[0], @mm, @B[0], @k_val,
  64.         @beta, @C_Blas[0], @mm);
  65.   end_time := GetTickCount64;
  66.   Writeln('OpenBLAS time: ', (end_time - start_time), ' ms');
  67.  
  68.   start_time := GetTickCount64;
  69.   NaiveMatrixMultiplyColMajor;
  70.   end_time := GetTickCount64;
  71.   Writeln('Naive time:    ', (end_time - start_time), ' ms');
  72.  
  73.   max_diff := 0.0;
  74.   for i := 0 to N * N - 1 do
  75.     max_diff := Max(max_diff, Abs(C_Naive[i] - C_Blas[i]));
  76.  
  77.   Writeln('Max difference: ', max_diff: 0: 10);
  78.   if max_diff < EPSILON then
  79.     Writeln('Results are consistent!')
  80.   else
  81.     Writeln('Results differ!');
  82.  
  83.   readln;
  84. end.
  85.  

Output (i7 8700; windows 11; fpc 3.2.2):

Code: [Select]
OpenBLAS time: 16 ms
Naive time:    2140 ms
Max difference: 0.0000000000
Results are consistent!

It appears that I am approaching the performance limit (16 ms), at least on my machine.  ;)
Title: Re: efficiency problem
Post by: ALLIGATOR on March 14, 2025, 06:13:44 pm
Code: [Select]
OpenBLAS time: 16 ms
Naive time:    2140 ms
Max difference: 0.0000000000
Results are consistent!
It appears that I am approaching the performance limit (16 ms), at least on my machine.  ;)

🤖🆒🎉
Title: Re: efficiency problem
Post by: Paolo on March 14, 2025, 07:46:05 pm
@LV  is it multithread ?
Title: Re: efficiency problem
Post by: Thaddy on March 14, 2025, 07:52:08 pm
Did you even read the code?

This has nothing to do with static/dynamic memory allocation - it's about fixed-size, contiguous, true 2D arrays vs. dynamically-sized arrays of arrays,
Yes it does. It even implies what I wrote.
Quote
the latter of which requires an extra pointer dereference and may have terrible memory locality.
NO. There is no difference once the memory is allocated. None whatsoever. There are only indirections by way of the programmer, not the compiler.
Title: Re: efficiency problem
Post by: Thaddy on March 14, 2025, 07:55:05 pm
@LV  is it multithread ?
Multi threading is not the holy grail. If your number of threads exceed the cpu count you are likely to write less efficient code if the code relies on a combined result.
Title: Re: efficiency problem
Post by: PascalDragon on March 14, 2025, 07:57:49 pm
Quote
the latter of which requires an extra pointer dereference and may have terrible memory locality.
NO. There is no difference once the memory is allocated. None whatsoever. There are only indirections by way of the programmer, not the compiler.

You are wrong. A static multidimensional array is one consecutive block in memory, so only a single memory access needs to be done. In a multidimensional dynamic array each sub array is again a pointer, so to access a value in a N-dimensional dynamic array N memory accesses need to be done.
Title: Re: efficiency problem
Post by: Thaddy on March 14, 2025, 08:16:06 pm
That should only matter on relocation (ragged array as result) but never on initialization. The memory layout should be the same.
Title: Re: efficiency problem
Post by: PascalDragon on March 14, 2025, 08:24:38 pm
It has nothing to do with relocation. It's how dynamic arrays work.

If you initialize a 2-dimensional array, let's say to dimensions of length 5 and 10 then the RTL will do 6 memory allocations (1 for the first dimension and 5 for the arrays of the second dimension) and all these might reside in different parts of the heap. Thus you will always have two dereferentiations (in this example).
Title: Re: efficiency problem
Post by: ALLIGATOR on March 15, 2025, 02:50:47 am
If you initialize a 2-dimensional array, let's say to dimensions of length 5 and 10 then the RTL will do 6 memory allocations (1 for the first dimension and 5 for the arrays of the second dimension) and all these might reside in different parts of the heap.
I don't use dynamic arrays n-dimensional (n>1), but I also thought that memory for them is allocated once in one continuous chunk. Thank you, now I know this feature 🙂
(✍️ memorized)
Title: Re: efficiency problem
Post by: photor on March 15, 2025, 07:08:45 am
As part of the exercise, I:
1. downloaded the libopenblas.dll file from https://sourceforge.net/projects/openblas/.
2. wrote a program.

Output (i7 8700; windows 11; fpc 3.2.2):

Code: [Select]
OpenBLAS time: 16 ms
Naive time:    2140 ms
Max difference: 0.0000000000
Results are consistent!

It appears that I am approaching the performance limit (16 ms), at least on my machine.  ;)

OpenBLAS is automatically multi-threaded. Please set the environment variable
Quote
OPENBLAS_NUM_THREADS=1
to disable it.
Title: Re: efficiency problem
Post by: Thaddy on March 15, 2025, 07:15:44 am
Amazed that I was wrong.
Title: Re: efficiency problem
Post by: LV on March 15, 2025, 07:00:15 pm
OpenBLAS is automatically multi-threaded. Please set the environment variable
Quote
OPENBLAS_NUM_THREADS=1
to disable it.

Yes, the OpenBLAS library does support multithreading. A procedure has been added to set the number of threads during execution, and the matrix sizes have been expanded to 3000x3000.

Code: Pascal  [Select][+][-]
  1. ...
  2. const
  3.   N = 3000;
  4.  
  5. ...
  6.   procedure openblas_set_num_threads(num: cint); cdecl; external 'libopenblas.dll';
  7.  
  8. begin
  9. ...
  10.   openblas_set_num_threads(1);
  11. ...
  12.   openblas_set_num_threads(2);
  13. ...
  14.   openblas_set_num_threads(3);
  15. ...
  16. end.
  17.  

output:  :-[

Code: Text  [Select][+][-]
  1. OpenBLAS NUM_THREADS=1: 890 ms
  2. OpenBLAS NUM_THREADS=2: 469 ms
  3. OpenBLAS NUM_THREADS=3: 312 ms
  4. OpenBLAS NUM_THREADS=4: 250 ms
  5. OpenBLAS NUM_THREADS=5: 235 ms
  6. OpenBLAS NUM_THREADS=6: 234 ms
  7. Naive Time : 111141 ms
  8. Max difference: 0.0000000000
  9. Results are consistent!
  10.  
Title: Re: efficiency problem
Post by: photor on March 16, 2025, 05:26:02 am
Yes, the OpenBLAS library does support multithreading. A procedure has been added to set the number of threads during execution, and the matrix sizes have been expanded to 3000x3000.

What's the result for N=1000, single-threaded?
Title: Re: efficiency problem
Post by: LV on March 16, 2025, 08:57:13 am
What's the result for N=1000, single-threaded?

For matrices with dimensions of N = 1000:

OpenBLAS NUM_THREADS=1: 31 ms

About two years ago, I chose FPC and Lazarus for various reasons. For symbolic calculations, I utilized Maxima and the DMath, LMath, and AlgLib libraries for algebraic computations. I now recognize the value of integrating OpenBlas for tasks where performance is critical.
Title: Re: efficiency problem
Post by: jamie on March 16, 2025, 12:33:38 pm
Amazed that I was wrong.
Maybe you were.
Jamie
Title: Re: efficiency problem
Post by: PascalDragon on March 18, 2025, 08:48:11 pm
Amazed that I was wrong.

Then you might be the only one that is amazed :P
TinyPortal © 2005-2018