program matmultiply;
Uses SysUtils;
Const
n=1000;
Type
Vector=Array[0..n-1] of Extended;
Matrix=Array[0..n-1] of Vector;
Var
StartMS:double;
d,dd:Matrix;
Function Multiply(Const mat1,mat2:Matrix):Matrix;
Var
i,j,k:Word;
sum:Extended;
v:Vector;
Begin
For i:=0 to n-1 do
For j:=0 to n-1 do
Begin
sum:=0;v:=mat1[i];
For k:=0 to n-1 do sum+=v[k]*mat2[k,j];
Multiply[i,j]:=sum;
End;
End;
Function ChebyPts:Vector;
Var
i:Word;
Begin
For i:=0 to n-1 do ChebyPts[i]:=Cos(i*Pi/(n-1));
End;
Function ChebyDiff:Matrix;
Var
i,j:Word;
c,CP:Vector;
Begin
For i:=1 to n-2 do c[i]:=1;
c[0]:=2;c[n-1]:=2;
CP:=ChebyPts;
For i:=0 to n-1 do
For j:=0 to n-1 do
if i<>j then ChebyDiff[i,j]:=(c[i]/c[j])*Cos((i+j)*Pi)/(CP[i]-CP[j])
else if i=0 then ChebyDiff[i,j]:=(2*(n-1)*(n-1)+1)/6
else if i=n-1 then ChebyDiff[i,j]:=-(2*(n-1)*(n-1)+1)/6
else ChebyDiff[i,j]:=-0.5*CP[j]/(1-CP[j]*CP[j]);
End;
begin
d:=ChebyDiff;
StartMS:=timestamptomsecs(datetimetotimestamp(now));
dd:=Multiply(d,d);
Writeln('Serially processed in '+floattostr(timestamptomsecs(datetimetotimestamp(now))-StartMS)+'ms.');
end.
Function Multiply(Const mat1,mat2:Matrix):Matrix;
Var
i,j,k:Word;
sum:Extended;
v:Vector;
mat2value:extended;
pv : PExtended; // define as ^extended if not known
pvend:pextended;
Begin
For i:=0 to n-1 do
begin
v:=mat1[i]; // invariant in j loop.
For j:=0 to n-1 do
Begin
sum:=0; mat2value:=mat2[k,j];
pv:=@v[0]; pvend:=@v[n-1];
while (pv<=pvend) do
begin
sum:=sum+pv^*mat2value;
inc(pv);
end;
Multiply[i,j]:=sum;
end;
End;
End;
Using the exact same looping code in Mathematica with the most precise but slow extended type? Or do you use in built-in optimized vector and matrix routines?
Anyway, a simple thing to try, store mat2[k,j] in a variable before the for k loop. A pointer construct for the most inner loop might also help.Code: [Select]Function Multiply(Const mat1,mat2:Matrix):Matrix;
Var
i,j,k:Word;
sum:Extended;
v:Vector;
mat2value:extended;
pv : PExtended; // define as ^extended if not known
pvend:pextended;
Begin
For i:=0 to n-1 do
begin
v:=mat1[i]; // invariant in j loop.
For j:=0 to n-1 do
Begin
sum:=0; mat2value:=mat2[k,j];
pv:=@v[0]; pvend:=@v[n-1];
while (pv<=pvend) do
begin
sum:=sum+pv^*mat2value;
inc(pv);
end;
Multiply[i,j]:=sum;
end;
End;
End;
Are you sure that the Vector elements must be of type Extended ? if you use Single The FP coprocessor wil still process at 80 bit but the data (heap) will be physically stored as 32 bit, which will reduce the processing time by 50 %.
The time delta could be explained by the way mathlab processes (eg SSE optimized matrix multiplication, maybe even open compute ?). You wont be able to get such results without using an optimized assembly routine working with the SSE instruction set (or a special math library). FP is slow.
Pascal is not a domain specific lang and to write such code is not necessarly easy for a scientist while with Matlab you don't have to mess with these tech. details since everything is under the hood.
The documentation of Mathematica says it works at extended precision by default, which I understand as using the FPU by default.On a platform with FPU support, FPC will use FPU. x86(_64) FPU precisions consist of single (32-bit), double (64-bit) and extended (80-bit) precision. Your program when compiled on my machine using:
And, by the way, I am getting a stack overflow here. You should consider to use dynamic arrays instead of static ones.Same here, I need to convert the code to dynamic array first.
This extreme difference in efficiency is not credible.
I reckon mathematica uses the Strassen Algorithmus which reduces the runtime for quadratic matrix multiplications from N^3 to N^2.807. With N=1000 this would be a runtime reduction to 26% just by the algorithm.
There are even better algorithms (Coppersmith–Winograd) with N^2.37 but wikipedia says these are not practicable yet.
The major efficiency problem in your code is the inner loop and the access to mat2[k,j], where k is the loop variable. Array access is significantly faster if you just increment the index. Maybe you should consider to conjugate one of the matrices before multiplicaction.
And, by the way, I am getting a stack overflow here. You should consider to use dynamic arrays instead of static ones.
On the coprocessor: FPC normally uses th x87 but you may try SSE using the switch {$FPUTYPE SSE3}. Sometimes this produces faster code, but not always.
Also, try to make sure that matlab does not just cache previously calculated results.
http://de.mathworks.com/company/newsletters/articles/matlab-incorporates-lapack.html
@Marcov: your code produces an entirely different result matrix than the original one.
Function Mul(Const mat1,mat2:Matrix):Matrix;
Var
i,j,k:Word;
sum:Extended;
u,v:PExtended;
mat:Matrix;
Begin
For i:=0 to n-1 do
For j:=0 to n-1 do mat[j,i]:=mat2[i,j];
For i:=0 to n-1 do
For j:=0 to n-1 do
Begin
sum:=0;u:=@mat1[i];v:=@mat[j];
For k:=0 to n-1 do
Begin
sum+=u^*v^;
Inc(u);Inc(v);
End;
Mul[i,j]:=sum;
End;
End;
This extreme difference in efficiency is not credible.
I reckon mathematica uses the Strassen Algorithmus which reduces the runtime for quadratic matrix multiplications from N^3 to N^2.807. With N=1000 this would be a runtime reduction to 26% just by the algorithm.
There are even better algorithms (Coppersmith–Winograd) with N^2.37 but wikipedia says these are not practicable yet.
The major efficiency problem in your code is the inner loop and the access to mat2[k,j], where k is the loop variable. Array access is significantly faster if you just increment the index. Maybe you should consider to conjugate one of the matrices before multiplicaction.
And, by the way, I am getting a stack overflow here. You should consider to use dynamic arrays instead of static ones.
On the coprocessor: FPC normally uses th x87 but you may try SSE using the switch {$FPUTYPE SSE3}. Sometimes this produces faster code, but not always.
Also, try to make sure that matlab does not just cache previously calculated results.
http://de.mathworks.com/company/newsletters/articles/matlab-incorporates-lapack.html
{$R-}{$Q-}
type valuetype = single;
var M1,M2,m3:array[1..1000,1..1000] of valuetype;
procedure TForm1.Button1Click(Sender: TObject);
var i,j,k:integer;
T:TDateTime;
sum:valuetype;
begin
for i:=1 to 1000 do
for j:=1 to 1000 do begin
M1[i,j]:={round(}random*50{)};
M2[i,j]:={round(}random*50{)};
end;
T:=now;
for i:=1 to 1000 do
for j:=1 to 1000 do begin
sum:=0;
for k:=1 to 1000 do
sum:=sum + M1[i,k]*M2[k,j];
M3[i,j]:=sum;
end;
Button1.caption:=inttostr(round((now-T)*24*60*60*1000))+' ms';
end;
24 827 ms | Single |
37 832 ms | Double |
116 832 ms | Extended |
51 894 ms | Integer |
38 047 ms | Word |
Just write it with C++ either with ICC or GCC and then statically link it. Simple solution. 8-)
I was really astonished to see that Double performed better than Word...I think that's due to the fact that you add the "round" function to integer types. This conversion adds additional computation compared to plain floating point multiplications.
I think that's due to the fact that you add the "round" function to integer types. This conversion adds additional computation compared to plain floating point multiplications.I start counting 'time' after the matrices have been prepared. (where T:=now) So round doesn't affect it.
I don't think C++ does that work better than pascal.Language wise: no. Compiler wise: yes. GCC does have this indexing optimization, so is ICC, and AFAIK Clang too. Perhaps other BIG compilers as well.Just write it with C++ either with ICC or GCC and then statically link it. Simple solution. 8-)
Now the same matrix multiplication only takes less than 2 seconds on my machine using -O3 in freepascal 2.6.4. That's a rather significant progress!
For i:=0 to n-1 do for j:=0 to n-1 do mat[j,i]:=mat2[i,j];
For i:=0 to n-1 do for j:=0 to n-1 do mat[j,i]:= 1.0; //wrong of course
filldword (mat, sizeof (mat) div 4, 0); //wrong of course
Function Mul(Const mat1,mat2:Matrix):Matrix;
Var
i,j,k:Word;
sum:Extended;
u,v:PExtended;
mat:Matrix;
rowstride:integer;
Begin
// check for smaller matrix here, must have at least 2 rows.
rowstride:=ptrint(@mat[1,0])-ptrint(@mat[0,0])
For i:=0 to n-1 do
For j:=0 to n-1 do
Begin
sum:=0;u:=@mat1[i];v:=@mat[i,j];
For k:=0 to n-1 do
Begin
sum+=u^*v^;
Inc(u);Inc(pbyte(v),rowstride);
End;
Mul[i,j]:=sum;
End;
End;
Type
Float = Double;
PFloat = ^Float;
Vector=Array[0..n-1] of Float;
Matrix=Array[0..n-1] of Vector;
PMatrix = ^Matrix;
procedure Multiply(Const mat1, mat2: PMatrix; var Result: PMatrix);
Var
i, j, k: Word;
u, v, x: PFloat;
Begin
For i:=0 to n-1 do Begin
u := @mat1^[i];
For k:=0 to n-1 do begin
v := @mat2^[k];
x := @Result^[i];
For j:=0 to n-1 do begin
x^ += u^*v^;
inc(v);
inc(x);
end;
inc(u);
End;
end;
End;
is slightly faster then the latest version when Float = Extended and ~3 times faster when Float = Double. Of course compared to version with the same element type (Extended - Extended, Double - Double).This is not correct: In matrix multiplication you have the multiply the row values of the first matrix with the column values of the second matrix. In this method, however, you alway multiply rows (or columns, depending how you look at the matrix).
program MatMulSSE2;
uses
sysutils;
type
valuetype = double; { should not be changed }
{$R-}{$Q-}
const
n = 1000;
var
M1,M2,TM2,M3: array[1..n,1..n] of valuetype;
row,col,k: integer;
T: TDateTime;
sum: valuetype;
sums:array[1..2] of valuetype;
gsum: valuetype;
pRow, pCol: pointer;
label
again;
begin
Randomize;
for row := 1 to n do
for col := 1 to n do begin
{ example 1 } //{
M1[row,col] := row; { or col }
M2[row,col] := 1;//}
{ example 2 } {
M1[row,col] := Random*50;
M2[row,col] := Random*50;//}
{ example 3 } {
if row = col then
begin
M1[row,col] := 1;
M2[row,col] := 1;
end
else
begin
M1[row,col] := 0;
M2[row,col] := 0;
end;//}
{ Transposition of the second matrix }
TM2[col, row] := M2[row,col];
end;
T := now;
gsum := 0;
for row := 1 to n do //row from 1st matrix
for col := 1 to n do //col from 2nd matrix
begin
{ original code in Pascal --> to use remove } {
sum := 0;
for k:=1 to n do //col from 1st / row from 2nd
{ using matrix M2 } {
sum:=sum + M1[row,k]*M2[k,col]; //}
{ using transposed matrix TM2 }
sum:=sum + M1[row,k]*TM2[col,k];
M3[row, col] := sum; //}
{ SSE2 code } //{
{ SSE2 handles two double values at the same time }
sums[1] := 0; sums[2] := 0;
pRow := @M1[row, 1];
pCol := @TM2[col, 1];
{$ASMMODE intel}
asm
{ loop counter }
mov ecx, n/2
{ sum := 0 }
movupd xmm2, sums
{ @M1[Row, 1] }
mov eax, pRow
{ @M2[1, Col] }
mov ebx, pCol
again:
{ xmm0 := M1[Row, k .. k+1] }
movapd xmm0, [eax]
//movupd xmm0, [eax]
{ xmm1 := TM2[Col, k .. k+1] }
movapd xmm1, [ebx]
{ xmm0 := M1[Row, k] * M2[k, Col] , M1[Row, k+1] * M2[k+1, Col] }
mulpd xmm0, xmm1
{ xmm2 := xmm2 + xmm0 }
addpd xmm2, xmm0
{ point at next pair of columns from M1[Row] }
add eax, 16 { 2 * size of valuetype }
{ point at next pair of rows from TM2[Col] }
add ebx, 16
dec ecx
jnz again
{ following steps are equivelent to sums[1] := sums[1] + sums[2] }
{ xmm3 := xmm2 }
movupd xmm3, xmm2
{ copy double value from higher position to lower position in xmm3 }
unpckhpd xmm3, xmm3
{ add the two numbers: sums[1] := sums[1] + sums[2] }
addpd xmm2, xmm3
movupd sums, xmm2
end ['eax', 'ebx', 'ecx', 'xmm0', 'xmm1', 'xmm2', 'xmm3'];
M3[row, col] := sums[1];// + sum[2];//}
gsum := gsum + M3[row, col];
end;
WriteLn('Grand sum of all the numbers in the result matrix: ', gsum);
WriteLn('Time: ', round((now-T)*24*60*60*1000),' ms');
WriteLn('');
WriteLn('First 10 rows and 3 columns of the result matrix:');
for row := 1 to 10 do
begin
for col := 1 to 3 do
Write(M3[row, col], ' ');
WriteLn('');
end;
end.
{$FPUTYPE SSE3}
Function Mulx(Const mat1,mat2:Matrix):Matrix;
Var
i,j,k,m: longint;
sum: float;
u,v: Pfloat;
mat: Matrix;
Begin
For i:=0 to n-1 do for j:=0 to n-1 do mat[j,i]:=mat2[i,j];
For i:=0 to n-1 do
begin
For j:=0 to n-1 do begin
sum:=0;
u:=@mat1[i];
v:=@mat [j];
For k:= 0 to n div 2 -1 do
Begin
sum+=u^*v^; Inc(u);Inc(v);
sum+=u^*v^; Inc(u);Inc(v);
End;
result[i,j]:=sum;
end;
end;
End;
Type
Float = Double;
PFloat = ^Float;
Vector=Array[0..n-1] of Float;
Matrix=Array[0..n-1] of Vector;
PMatrix = ^Matrix;
procedure Multiply(Const mat1, mat2: PMatrix; var Result: PMatrix);
Var
i, j, k: Word;
u, v, x: PFloat;
Begin
For i:=0 to n-1 do Begin
u := @mat1^[i];
x := @Result^[i];
For k:=0 to n-1 do begin
x^ := 0;
inc(x);
end;
For k:=0 to n-1 do begin
v := @mat2^[k];
x := @Result^[i];
For j:=0 to n-1 do begin
x^ += u^*v^;
inc(v);
inc(x);
end;
inc(u);
End;
end;
End;
It's sad that the compiler doesn't do the job with pointer arithmetics for array indexing and let us do it by hand.It is, I'd love to get my hands dirty but I'm lacking the required skills. I'll retry when I get some time, maybe I was just not experienced enough.
Pointer arithmetics is a bad practice.
I wonder if SSE2 speed on AMD CPUs is not as good as on Intel's.My trusted AMD CPU is 10 years old... don't blame it ;)
without transposition the results are wrong
For i:=0 to n-1 do
for k:=0 to n-1 do
for j:=0 to n-1 do result[i,j] += mat1[i,k]*mat2[k,j];
wp, engkin: it's not my idea, it's usual "ikj" algorithm and gives identical result to "ijk" one without any transpositions. The speedup is due to more cache friendly memory access pattern.
There is almost no speedup when Float = Extended and Float = Int64 for some reason. Didn't test in 64-bit mode.
My implementation missing initial zeroing out (but test passed :o ). Here is fixed one (still 3 times faster for double):Code: [Select]Type
Float = Double;
PFloat = ^Float;
Vector=Array[0..n-1] of Float;
Matrix=Array[0..n-1] of Vector;
PMatrix = ^Matrix;
procedure Multiply(Const mat1, mat2: PMatrix; var Result: PMatrix);
Var
i, j, k: Word;
u, v, x: PFloat;
Begin
For i:=0 to n-1 do Begin
u := @mat1^[i];
x := @Result^[i];
For k:=0 to n-1 do begin
x^ := 0;
inc(x);
end;
For k:=0 to n-1 do begin
v := @mat2^[k];
x := @Result^[i];
For j:=0 to n-1 do begin
x^ += u^*v^;
inc(v);
inc(x);
end;
inc(u);
End;
end;
End;
It's sad that the compiler doesn't do the job with pointer arithmetics for array indexing and let us do it by hand.
Pointer arithmetics is a bad practice.
{$FPUTYPE SSE3}
Function Mul(Const mat1,mat2:Matrix):Matrix;
Var
i,j,k:Word;
u,v,p:pextended;
Begin
For i:=0 to n-1 do
begin
u:=@mat1[i];
For j:=0 to n-1 do
Begin
v:=@mat2[j];p:=@Mul[i];
For k:=0 to (n div 2)-1 do
begin
p^+=u^*v^;inc(v);inc(p);
p^+=u^*v^;inc(v);inc(p);
end;
inc(u);
End;
end;
End;
Not bad. On My AMD Athlon, it is 3000ms (vs. 4400ms using plain pascal with SSE2 support). Still, I would think the compiler is not doing a bad job.
Wait... 3700ms without using assembler :)Code: [Select]{$FPUTYPE SSE3}
Function Mulx(Const mat1,mat2:Matrix):Matrix;
Var
i,j,k,m: longint;
sum: float;
u,v: Pfloat;
mat: Matrix;
Begin
For i:=0 to n-1 do for j:=0 to n-1 do mat[j,i]:=mat2[i,j];
For i:=0 to n-1 do
begin
For j:=0 to n-1 do begin
sum:=0;
u:=@mat1[i];
v:=@mat [j];
For k:= 0 to n div 2 -1 do
Begin
sum+=u^*v^; Inc(u);Inc(v);
sum+=u^*v^; Inc(u);Inc(v);
End;
result[i,j]:=sum;
end;
end;
End;
Not bad. On My AMD Athlon, it is 3000ms (vs. 4400ms using plain pascal with SSE2 support). Still, I would think the compiler is not doing a bad job.
Wait... 3700ms without using assembler :)Code: [Select]{$FPUTYPE SSE3}
Function Mulx(Const mat1,mat2:Matrix):Matrix;
Var
i,j,k,m: longint;
sum: float;
u,v: Pfloat;
mat: Matrix;
Begin
For i:=0 to n-1 do for j:=0 to n-1 do mat[j,i]:=mat2[i,j];
For i:=0 to n-1 do
begin
For j:=0 to n-1 do begin
sum:=0;
u:=@mat1[i];
v:=@mat [j];
For k:= 0 to n div 2 -1 do
Begin
sum+=u^*v^; Inc(u);Inc(v);
sum+=u^*v^; Inc(u);Inc(v);
End;
result[i,j]:=sum;
end;
end;
End;
Maybe free pascal automatically initializes a variable to be zero?No, only dynamic arrays are zeroed on creation with SetLength (). But I think argb32 does in fact zero his tmp variable by
For k:=0 to n-1 do begin
x^ := 0;
inc(x);
And I forgot to mention that my Windows 8.1 is 64bit.If wou want to use 64 bit, you can download the x64 version of FPC. It is a command line cross compiler and comes without IDE ; to use it with IDE you would have to build it yourselves. At least this is what I understood. I do not know how to do this however.
It is sad that free pascal can't do sse2 optimization in such simple caseJa, well, I would love FPC to be the best optimising compiler in the world, but I guess that would need a lot of ressources, which are tied otherwise, for issues such as unicode support and platform independancy. We should also consider that the vast majority of programmers are doing C or one of its clones, and accordingly huge effort has been put into these compilers. It should be no surprise that they may perform better.
Added note: My free pascal version is 2.64. I don't know if higher version will do the correct sse2 optimization.I have tried it with 3.1.1 (as of 29.1.) and the answer is no.
On the other hand, algebraic algorithms like matrix multiplication have been optimised manually to buggery over the past 30 years, each CPU manufacturer offering his own optimised BLAS library, and no compiler will reach the performance of these highly tuned libraries anyway. The best solution in your case is probably to link one of these libraries into your code. Why re-inventing the wheel ?But how to find and link these libraries, if it is free (I don't want to use commercial things in my program), for example, for an Intel 64bit CPU, in my free pascal program?
But how to find and link these libraries, if it is free (I don't want to use commercial things in my program), for example, for an Intel 64bit CPU, in my free pascal program?Googling its name and read the license yourself?
{$mode objfpc}
Uses SysUtils;
Const Dim=1000;
Type
Matrix=Array[0..Dim-1, 0..Dim-1] of double;
int = longint;
Var
T0: TDateTime;
A0, B0, C0: Matrix;
A, B, C: pdouble;
procedure DGEMM (TRANSA,TRANSB: char;
m,n,k: int;
alpha: double;
A : pdouble;
LDA : int;
B : pdouble;
LDB : int;
beta : double;
C : pdouble;
LDC : int); cdecl; external 'libopenblas';
procedure Init (var M: Matrix);
Var i,j: dword;
Begin
For i:=0 to Dim-1 do for j:=0 to Dim-1 do M[i,j] := 1.0* (i+j);
End;
begin
Init (A0); B0:=A0; C0 := B0;
A := @A0; B := @B0; C := @C0;
T0 := Now;;
DGEMM ('n','n', 1000, 1000, 1000, 1.0, A, 1000, B, 1000, 0.0, C, 1000);
Writeln('Time elapsed: ',(Now-T0)*3600*24*1000:0:0,'ms.');
end.
{$mode objfpc}
Uses SysUtils;
Const Dim=1000;
Type
Matrix=Array[0..Dim-1, 0..Dim-1] of double;
int = longint;
Var
T0: TDateTime;
A0, B0, C0: Matrix;
procedure DGEMM (constref TRANSA,TRANSB: char;
constref m,n,k: int;
constref alpha: double;
var A : Matrix;
constref LDA : int;
var B : Matrix;
constref LDB : int;
constref beta : double;
var C : matrix;
constref LDC : int); cdecl; external 'libopenblas';
procedure Init (var M: Matrix);
Var i,j: dword;
Begin
For i:=0 to Dim-1 do for j:=0 to Dim-1 do M[i,j] := 1.0* (i+j);
End;
begin
Init (A0); B0:=A0;
T0 := Now;;
DGEMM ('n','n', Dim, Dim, Dim, 1.0, A0, Dim, B0, Dim, 0.0, C0, Dim);
Writeln('Time elapsed: ',(Now-T0)*3600*24*1000:0:0,'ms.');
end.
Alright, we don't even need the pointers if the matrix is passed by variable. Attached simplified code.
@photor - instruction:
Download BLAS and mingw pack from http://sourceforge.net/projects/openblas/files/v0.2.13/. Unpack the dlls into a directory of your choice and set this directory under Options / Directories / Libraries in the FPC IDE. Then the code should already compile.
You might consider using the 64bit version, and the multithreaded options proposed by Steindorff and JorgeAldo. On my single core CPU, I do not expect a benefit from multiple threads.Code: [Select]{$mode objfpc}
Uses SysUtils;
Const Dim=1000;
Type
Matrix=Array[0..Dim-1, 0..Dim-1] of double;
int = longint;
Var
T0: TDateTime;
A0, B0, C0: Matrix;
procedure DGEMM (constref TRANSA,TRANSB: char;
constref m,n,k: int;
constref alpha: double;
var A : Matrix;
constref LDA : int;
var B : Matrix;
constref LDB : int;
constref beta : double;
var C : matrix;
constref LDC : int); cdecl; external 'libopenblas';
procedure Init (var M: Matrix);
Var i,j: dword;
Begin
For i:=0 to Dim-1 do for j:=0 to Dim-1 do M[i,j] := 1.0* (i+j);
End;
begin
Init (A0); B0:=A0;
T0 := Now;;
DGEMM ('n','n', Dim, Dim, Dim, 1.0, A0, Dim, B0, Dim, 0.0, C0, Dim);
Writeln('Time elapsed: ',(Now-T0)*3600*24*1000:0:0,'ms.');
end.
Alright, we don't even need the pointers if the matrix is passed by variable. Attached simplified code.
@photor - instruction:
Download BLAS and mingw pack from http://sourceforge.net/projects/openblas/files/v0.2.13/. Unpack the dlls into a directory of your choice and set this directory under Options / Directories / Libraries in the FPC IDE. Then the code should already compile.
You might consider using the 64bit version, and the multithreaded options proposed by Steindorff and JorgeAldo. On my single core CPU, I do not expect a benefit from multiple threads.Code: [Select]{$mode objfpc}
Uses SysUtils;
Const Dim=1000;
Type
Matrix=Array[0..Dim-1, 0..Dim-1] of double;
int = longint;
Var
T0: TDateTime;
A0, B0, C0: Matrix;
procedure DGEMM (constref TRANSA,TRANSB: char;
constref m,n,k: int;
constref alpha: double;
var A : Matrix;
constref LDA : int;
var B : Matrix;
constref LDB : int;
constref beta : double;
var C : matrix;
constref LDC : int); cdecl; external 'libopenblas';
procedure Init (var M: Matrix);
Var i,j: dword;
Begin
For i:=0 to Dim-1 do for j:=0 to Dim-1 do M[i,j] := 1.0* (i+j);
End;
begin
Init (A0); B0:=A0;
T0 := Now;;
DGEMM ('n','n', Dim, Dim, Dim, 1.0, A0, Dim, B0, Dim, 0.0, C0, Dim);
Writeln('Time elapsed: ',(Now-T0)*3600*24*1000:0:0,'ms.');
end.
I have not followed him in a while but you should check into the stuff "aminer" offers. I believe I saw he offered a multi threaded matrix multiply a while ago were the times were almost 50 times faster than a single thread.I have checked aminer's stuff, but it has gone.
I have tested your code. Compiling is ok, but when running it gives an error message "libgfortran-3.dll not found", why?
Of course I have mingw64.dll, otherwise there would be an error message "mingw64.dll not found" before the libgfortran-3.dll one.I have tested your code. Compiling is ok, but when running it gives an error message "libgfortran-3.dll not found", why?
You need to download mingw32_dll.zip (http://sourceforge.net/projects/openblas/files/v0.2.13/mingw32_dll.zip/download) (or mingw64_dll.zip (http://sourceforge.net/projects/openblas/files/v0.2.13/mingw64_dll.zip/download)) from that same link.
Of course I have mingw64.dll, otherwise there would be an error message "mingw64.dll not found"
here is a post for his demo for matrix back in 2013I have not followed him in a while but you should check into the stuff "aminer" offers. I believe I saw he offered a multi threaded matrix multiply a while ago were the times were almost 50 times faster than a single thread.I have checked aminer's stuff, but it has gone.
Thanks. Following your instruction, I downloaded mingw64_dll.zip and obtained all the dll files needed. This time the test program compiled and run, but finally raised an exeption class "External: SIGSEGV". Has anybody successfully tested this in the 64bit case?Of course I have mingw64.dll, otherwise there would be an error message "mingw64.dll not found"
Yes, sure. You *might* have mingw64.dll alone. The link provides libgfortran-3.dll, libgcc_s_seh-1.dll, and libquadmath-0.dll which you need to put in the same directory (or path) with your exe file. So yes, you need to download mingw64_dll.zip and expand it and put the files visible to your application.
I have checked all of these. The download links on his website are all redirected to a docs.google.com site, and returned a "not found" message eventually.here is a post for his demo for matrix back in 2013I have not followed him in a while but you should check into the stuff "aminer" offers. I believe I saw he offered a multi threaded matrix multiply a while ago were the times were almost 50 times faster than a single thread.I have checked aminer's stuff, but it has gone.
http://forum.lazarus.freepascal.org/index.php/topic,23009.msg136762.html#msg136762
he still has the code on his web site at
http://pages.videotron.com/aminer/
were he says he worked off a math library from "mrsoft" @
http://www.mrsoft.org/home/index.html
hope that helps
Yes, I had the same issue with aminers files. But having a single core CPU only, his major approach of splitting up tasks into threads ("workers") won't be a benefit for me anyway.But where can I download the 64bit FPC cross compiler?
I have not tried the 64bit version because I already thought this might be a step too far and would cause problems, most likely with passing the parameters. Even in a 32bit environment, it took me a while to translate the C expectations into Pascal syntax, using constref and var as required, and I receives many sigsevs before I got it right.
In the 64bit version, the parameters will yet be passed differently, in different registers and what not. It may be easier with the 64bit FPC cross compiler, which is available but without IDE; to work with an IDE, you will have to make an own built in the first place. My advice is - the 32bit version spares you a lot of hassle, and you will not get a factor 2 improvement from using 64bit anyway.
http://www.freepascal.org/download.varThanks. But all the download links on that page will be redirected to
Instead of sourceforge, go to austriaOk, I see.
Thanks. Following your instruction, I downloaded mingw64_dll.zip and obtained all the dll files needed. This time the test program compiled and run, but finally raised an exeption class "External: SIGSEGV". Has anybody successfully tested this in the 64bit case?
Win64-int32 for 64-bit Windows with 32-bit int (default on 64-bit Windows)You might want to try the first one, as Nitorami's code is suitable for 32bit int:
Win64-int64 for 64-bit Windows with 64-bit int
int = longint;
OR (and this is just a guess) make: int = int64;
Thanks. Following your instruction, I downloaded mingw64_dll.zip and obtained all the dll files needed. This time the test program compiled and run, but finally raised an exeption class "External: SIGSEGV". Has anybody successfully tested this in the 64bit case?
Sorry I missed your post. I am on a 32bit OS so I can't be sure. But I see that the 64bit OpenBLAS comes in two flavors:QuoteWin64-int32 for 64-bit Windows with 32-bit int (default on 64-bit Windows)You might want to try the first one, as Nitorami's code is suitable for 32bit int:
Win64-int64 for 64-bit Windows with 64-bit intCode: [Select]int = longint;
OR (and this is just a guess) make:Code: [Select]int = int64;
After I replace "longint" with "int64", the 64bit version works!
Result comparison on my 2.7GHz i7 cpu with 64bit Win 8.1:
OpenBLAS 64bit: 140~150ms
Mathematica: 70~80ms
Although Mathematica is still two times faster than OpenBLAS, they are comparable.
If someone could the code that I attached and give me some north would be very grateful.
I do not know if it is possible to make all threads share the same array.
I did not do a thorough investigation, but if you want the ultimate performance, it will be hard to compete with your own program against Lapack/Blas/Atlas. To the best of my knowledge, memory access is the bottle neck when dealing with large number of data. Specifically, you need to take into account the memory cache, in particular with many data and few operations. I once had an example, where setting 4 arrays to zero was the bottle neck. When trying to use 2 or more cores for each array, the program became SLOWER, because the cores were swapping the arrays in and out of the cache.The exchange of memory in the cache is inevitable. With or without threads.
This can become quite involved and CPU/cache specific. When compiling Atlas (http://math-atlas.sourceforge.net), several methods are actually tested and the best is taken. At least that is what they claim.
MiSchi.
Calling C and Fortran routines is not so difficult. You need to translate the function calls in a Pascal header and link the library. I could help you with a simple example. With 2-dimensional arrays, you need to take into account that C has the same row/column layout as Pascal, whereas Fortran has it reversed. I would expect that you may have to make a decision up front, whether you want a pure Pascal solution and do everything by yourself or go for ultimate performance by calling Lapack/Blas.I do not expect to match a LAPLACK, but at least something could be learned and used as a guide to know where to improve my algorithms. So I told myself that if necessary, I'm not afraid to explore the source code and take note.
All above is only to the best of my knowledge and might be utterly wrong ;-)
Sorry I cannot delve into your code now. Just a few hints-I know. What I have assumed. I can not compete against geniuses who have years perfecting that huge library.
- You cannot compete against LAPACK or similar, this will be at least factor 3-5 better than your best code
- Array of array of float is very convenient but not too efficient as the matrix may be spread across the memory and the compiler has to calculate the memory location on each access. As you already noted, a linear structure in memory is better, i.e. a vector.
- Don't get yourself into the trouble of manual memory allocation. Use dynamic arrays. You can still use pointer access to them if that is better, just assign a pdouble to @Matrix[0]. Only if you repeatedly have to allocate and deallocate the structure, you should keep in mind that when a dynamic array is created it is automatically initialised with zeroes, which is not the case for a structure generated with New(). The initialisation takes a bit of time but in general this is negligible.That proposal was my third attempt, which I called "mixed" since it combines the declaration of a dynamic array with access by pointers.
- The major bottleneck with matrix multiplication in my experience is that one matrix can be read linearly, i.e. as it is arranged in memory, but the other not, which may give you extreme cache penalties. Always consider to transpose this matrix before mutliplication so it can be accessed linearly.I am aware that the traditional algorithm suffers from cache penalties.
Edit: Splitting a task to several threads is not necessarily complicated, see the example for the Mandelbrot calculation here http://benchmarksgame.alioth.debian.org/u64q/program.php?test=mandelbrot&lang=fpascal&id=5
If you can come up with an simple example, which uses Lapack/Blas routines, I could try to help out, like a product of a vector and a matrix or similar.I think I did not explain myself very well.
After so many years, I just realized that one of the main reasons is that matrices of Pascal (like C) are row major instead of column major, so the Multiply function in the code should be modified asNow running the code takes about 1.5 seconds, instead of 11 seconds.
Function Multiply(Const mat1,mat2:Matrix):Matrix; Var i,j,k:Word; sum:Float; v:Vector; Begin For j:=0 to n-1 do Begin For k:=0 to n-1 do v[k]:=mat2[k,j]; For i:=0 to n-1 do Begin sum:=0; For k:=0 to n-1 do sum+=mat1[i,k]*v[k]; Multiply[i,j]:=sum; End; End; End;
Execution Time: 164 ms ;)
well done Photor,
but why didn't you use I-K-J or K-I-J looping, they are row oriented, and I see they are much faster. Later I'll post my results.
I may have done something wrong, I can't figure it out, my head is busy with other things, but this code speeds up calculations significantly (multiple times). But I repeat - it requires verification.
You are doing un-rolling, yes it can help.The compiler does that for you and often more efficient.
program MatrixMultiplication; {$mode objfpc} {$optimization on} uses SysUtils, DateUtils; const N = 1000; type TMatrix = array [0..N - 1, 0..N - 1] of double; TVector = array [0..N - 1] of double; var A, B: TMatrix; StartTime: TDateTime; procedure InitializeMatrix(out M: TMatrix); var i, j: integer; begin for i := 0 to N - 1 do for j := 0 to N - 1 do M[i, j] := Random; end; function Multiply(const mat1, mat2: TMatrix): TMatrix; var i, j, k: NativeUInt; sum: double; v: TVector; begin for j := 0 to N - 1 do begin for k := 0 to N - 1 do v[k] := mat2[k, j]; for i := 0 to N - 1 do begin sum := 0; k:=0; while k < N div 8 do begin sum += mat1[i, k+0] * v[k+0] + mat1[i, k+1] * v[k+1] + mat1[i, k+2] * v[k+2] + mat1[i, k+3] * v[k+3] + mat1[i, k+4] * v[k+4] + mat1[i, k+5] * v[k+5] + mat1[i, k+6] * v[k+6] + mat1[i, k+7] * v[k+7]; inc(k, 8); end; Result[i, j] := sum; end; end; end; begin InitializeMatrix(A); InitializeMatrix(B); StartTime := Now; Multiply(A, B); Writeln('Execution Time: ', MilliSecondsBetween(Now, StartTime), ' ms'); ReadLn; end.
I may have done something wrong, I can't figure it out, my head is busy with other things, but this code speeds up calculations significantly (multiple times). But I repeat - it requires verification.
"while k < N div 8 do" is wrong.
@Alligator,Of course! It is! I believe this is sufficient to show the acceleration effect itself.
but it is still wrong. You are assuming "n" be a multiple of 4 but that us not general case.
@Alligator,Of course! It is! I believe this is sufficient to show the acceleration effect itself.
but it is still wrong. You are assuming "n" be a multiple of 4 but that us not general case.UPD: Corrected the code to be correct for N not divisible by the loop promotion value
seems to be wrong, but it doesn't change the essence - unrolling the loop body - gives acceleration, depending on the hardware, on mine - x2.
Does that mean the fp compiler is not smart enough to do unrolling automatically?
Does that mean the fp compiler is not smart enough to do unrolling automatically?
At the moment, unfortunately, it is.
(https://wiki.freepascal.org/Optimization#Optimization_Switches)
Even at the -O3 optimization level, where the “LOOPUNROLL” optimization is activated (http://where the “LOOPUNROLL” optimization is activated)
-O3: O2 + CONSTPROP + DFA + USELOADMODIFYSTORE + LOOPUNROLL
BUT! A lot of work is being done and sometime in the future it will be possible, I hope. Also, besides the native compiler/optimizer you can use the LLVM backend - the situation will probably be better there (but I haven't tried it because the LLVM backend is not available on Windows yet).
In fact, you were told correctly somewhere above - if you need to squeeze everything out of your hardware - it is better to compile this function in GCC/CLANG/ICC or use ready-made special libraries, or switch to assembler inserts/assembler functions inside Pascal.
But, you may be satisfied with this performance, which can be achieved by pure Pascal at its current level
Is the above speed up by unrolling related to SIMD?loop rolling and SIMD are independent things
And how can we know whether the compiled code uses SIMD or not?You can find out very simply by looking in the assembly code. Place a breakpoint on the code section (line) of interest, run the program, the breakpoint will be triggered - then call the assembler window - Ctrl+Shift+D (or menu “View” -> “Debug Windows” -> “Assembler”).
I tried to use -CfAVX2 in the command line options, but it seems to have no effect on the speed.Yes, for the most part, this key will provide little to no speedup since FPC will still use scalar versions of instructions rather than vector versions. Although... once I saw that he was able to use FMA instructions, thus speeding up the code.
Is the above speed up by unrolling related to SIMD?loop rolling and SIMD are independent things
in the example above, the speedup was due to the fact that the processor was able to use more ALUs available because it realized that it could do this work in parallel at the microcommand level, i.e. calculate twice as much in one cycle.And how can we know whether the compiled code uses SIMD or not?You can find out very simply by looking in the assembly code. Place a breakpoint on the code section (line) of interest, run the program, the breakpoint will be triggered - then call the assembler window - Ctrl+Shift+D (or menu “View” -> “Debug Windows” -> “Assembler”).I tried to use -CfAVX2 in the command line options, but it seems to have no effect on the speed.Yes, for the most part, this key will provide little to no speedup since FPC will still use scalar versions of instructions rather than vector versions. Although... once I saw that he was able to use FMA instructions, thus speeding up the code.
By the way, adding -CfAVX2 replaced several instructions with a single SIMD FMA instruction
vfmadd231d xmm4,xmm8,[r11*8+rax+$18]
but, specifically on my hardware, it didn't give any speedup whatsoever
https://discourse.julialang.org/t/julia-matrix-multiplication-performance/55175/11
which works very well in Julia. I tried to do similar things in fp, but not yet succeeded so far (maybe need more check).
If we are not focused on educational goals while multiplying two matrices, it's advisable to use libraries instead. For example, Now I spent five minutes using the AlgLib Free Edition library with low-level optimizations inside. To do this, I placed the wrapper file xalglib.pas and the DLL file alglib402_64free.dll in the program folder, then compiled and ran the program.
You can also try these options:
https://github.com/mikerabat/mrmath
https://github.com/clairvoyant/cblas
But where can I get the AlgLib Free Edition library?lmddgtfy (https://lmddgtfy.net/?q=AlgLib%20Free%20Edition)
Have you tried those packages?
But where can I get the AlgLib Free Edition library?
But where can I get the AlgLib Free Edition library?
I returned from work and saw that I received help with the answer. Thanks, @TRon :)
@photor. You can download the zip file from https://www.alglib.net/download.php. Once downloaded, unzip it and copy both the wrapper and the DLL files to the program folder. That's all you need to do.
Also, @ALLIGATOR, you wrote some excellent code above. Would you mind if I attempted to parallelize it?
Execution Time 1: 987 ms Execution Time 2 (Threaded): 71 ms IsSame: TRUE
I think it's a pretty good result. :)
Possible to parallelize the previous version using AlgLib Free Edition library?
Possible to parallelize the previous version using AlgLib Free Edition library?
I attempted to parallelize the AlgLib Free Edition library. It compiles and executes with a runtime of 50 to 60 ms. However, the results fail the check: IsSame: FALSE. It is suspected that the library is not thread-safe.
Out of curiosity, my colleague ran the same task on MATLAB R2015b (64-bit) and reported an execution time of 90 to 110 ms.
The execution time of the program (Reply #116), which was written in pure Object Pascal, is between 70 and 80 ms.
I must admit that a couple of years ago, I was initially skeptical about using FPC and Lazarus. Now I am convinced that Free Pascal is quite suitable for this type of task. ;)
P.S. All tests were conducted on a desktop with an Intel i7 8700 processor running Windows 11.
Updated. Matrix multiplication in NumPy takes about 10-15 ms. This library appears to solve the problem using GPU with cuBLAS (NVIDIA)?
So, anyway, AlgLib's 108 ms with a single thread is not bad. But the multi-threaded version of AlgLib is not free, unfortunately.
Performance Comparison Table Relative to NumPy:
Code I (one-thrеaded) from Reply #89 64.4 Code II (one-thrеaded) from Reply #121 51.0 Code III (one-thrеaded) from Reply #103 24.2 Code IV (AlgLib one-thrеaded) from Reply #110 7.1 Code V (Code III four-thrеaded) from Reply #116 7.0 Code VI (AlgLib four-thrеaded) from Reply #120 3.5 NumPy (twelve-threaded?) 1.0
Hi @photor! Note: NumPy on my machine uses the MKL library. I tested performance with one and then four threads using that library. Here are the results:
ms ------------------------------------------------------- Code I (one-thrеaded) from Reply #89 966 Code II (one-thrеaded) from Reply #121 765 Code III (one-thrеaded) from Reply #103 363 Code IV (AlgLib one-thrеaded) from Reply #110 107 Code V (Code III four-thrеaded) from Reply #116 104 Code VI (AlgLib four-thrеaded) from Reply #120 53 NumPy (one-thrеaded MKL) 37 NumPy (four-thrеaded MKL) 14 --------------------------------------------------------
Thanks. Typically numpy uses MKL or openblas as the backend. Actually openblas is a completely free library written by C, so can free pascal directly call it instead of AlgLib?
my last attempt with one single thread, can you check the speed ?
I made the size of array variable (mode = Delphi, no range check)
I will post something soon, when I have some spare time
Once memory for the arrays is allocated there is no speed difference between static and dynamic arrays as long as the code to access the array elements is exactly the same. Dynamic arrays only have a slight speed penalty at allocation time. Then again, a static array may affect the load time of the executable.
OpenBLAS time: 16 ms
Naive time: 2140 ms
Max difference: 0.0000000000
Results are consistent!
Code: [Select]OpenBLAS time: 16 ms
It appears that I am approaching the performance limit (16 ms), at least on my machine. ;)
Naive time: 2140 ms
Max difference: 0.0000000000
Results are consistent!
Did you even read the code?Yes it does. It even implies what I wrote.
This has nothing to do with static/dynamic memory allocation - it's about fixed-size, contiguous, true 2D arrays vs. dynamically-sized arrays of arrays,
the latter of which requires an extra pointer dereference and may have terrible memory locality.NO. There is no difference once the memory is allocated. None whatsoever. There are only indirections by way of the programmer, not the compiler.
@LV is it multithread ?Multi threading is not the holy grail. If your number of threads exceed the cpu count you are likely to write less efficient code if the code relies on a combined result.
Quotethe latter of which requires an extra pointer dereference and may have terrible memory locality.NO. There is no difference once the memory is allocated. None whatsoever. There are only indirections by way of the programmer, not the compiler.
If you initialize a 2-dimensional array, let's say to dimensions of length 5 and 10 then the RTL will do 6 memory allocations (1 for the first dimension and 5 for the arrays of the second dimension) and all these might reside in different parts of the heap.I don't use dynamic arrays n-dimensional (n>1), but I also thought that memory for them is allocated once in one continuous chunk. Thank you, now I know this feature 🙂
As part of the exercise, I:
1. downloaded the libopenblas.dll file from https://sourceforge.net/projects/openblas/.
2. wrote a program.
Output (i7 8700; windows 11; fpc 3.2.2):Code: [Select]OpenBLAS time: 16 ms
Naive time: 2140 ms
Max difference: 0.0000000000
Results are consistent!
It appears that I am approaching the performance limit (16 ms), at least on my machine. ;)
OPENBLAS_NUM_THREADS=1to disable it.
OpenBLAS is automatically multi-threaded. Please set the environment variableQuoteOPENBLAS_NUM_THREADS=1to disable it.
Yes, the OpenBLAS library does support multithreading. A procedure has been added to set the number of threads during execution, and the matrix sizes have been expanded to 3000x3000.
What's the result for N=1000, single-threaded?
Amazed that I was wrong.Maybe you were.
Amazed that I was wrong.