Recent

Author Topic: efficiency problem  (Read 56987 times)

Nitorami

  • Hero Member
  • *****
  • Posts: 605
Re: efficiency problem
« Reply #45 on: March 20, 2015, 06:48:35 pm »
Yes, and the wikipedia article on BLAS also contains a lot of links to BLAS implementations. The BLAS interface is de facto a standard, rather old, and many implementations have been created, most of them open source. CPU manufacturers may offer their own implementation for download, particularly tuned to their specific processor. It might be better to get a generic implementation which is less optimised but more portable.

The problem for a noob like me is to get an overview in the jungle. I downloaded a windows dll from OpenBLAS (dll file) just to try it out, and managed to link it as external library, but then got an error message that this dll needs yet another fortranwhatever.dll, o heavens, I do not know where to get that, and I do not like have my programs depend on a pile of external dlls in the first place so I gave it up. Can't be bothered to fuff around with that stuff any longer, but if you find out how to do it I would be interested to know...

Edit - I should also be greateful if someone could explain why an external library is not sufficient to get this thing running, and why I need to install MinGW an MSys and what not.
« Last Edit: March 20, 2015, 06:56:00 pm by Nitorami »

JorgeAldo

  • New Member
  • *
  • Posts: 11
Re: efficiency problem
« Reply #46 on: March 20, 2015, 09:46:46 pm »
i ve not delved into the whole topic, but if you problem is easily parallelizable and the setup time inst too big you might try to split it into threads,

search google for pascal actor model
its a library of mine that makes easier to parallelize problems in the background into an arbitrary number of actors
its fully oop and allows easy memory passing among actors and the foreground threads

mas steindorff

  • Hero Member
  • *****
  • Posts: 565
Re: efficiency problem
« Reply #47 on: March 21, 2015, 01:57:02 am »
I have not followed him in a while but you should check into the stuff "aminer" offers.  I believe I saw he offered a multi threaded matrix multiply a while ago were the times were almost 50 times faster than a single thread. 
windows 10 &11, Ubuntu 21+ IDE 3.4 general releases

mnar53

  • Newbie
  • Posts: 3
Re: efficiency problem
« Reply #48 on: March 21, 2015, 07:44:11 am »
just download from the openblas sourceforge mingw32_dll.zip or mingw64_dll.zip (if you are after 32 or 64 bit environment).
They both contain 3 dll's, which are needed at the runtime because how the openblas dll was compiled: put them in the
same folder of openblas. One final note: all the function in openblas are following the cdecl convention, as an example:

function  dasum  (constref N: integer; X : pDoubleArray; constref incX : integer) : double; cdecl; external 'openblas.dll'';

Openblas is quite performant, actually i think is the best CPU computing choice, unless you are moving to GPU
computing (e.g., CUDA), assuming you have a good graphics card.

By the way: Openblas actually contains all the Lapack routines, some of them optimized, but i think they all benefit from
the use of openblas routines.

Best regards

Nitorami

  • Hero Member
  • *****
  • Posts: 605
Re: efficiency problem
« Reply #49 on: March 21, 2015, 10:58:42 am »
Thank you all. I copied the lipopenblas.dll and other required dlls, and it compiles. I set up a small example for a quadratic matrix multiply but it throws an access violation. Probably an error in passing the parameters in C style. I tried constref an other variations but does not help. Can someone give me a hint how it should be done ?

Code: [Select]
{$mode objfpc}
Uses SysUtils;

Const Dim=1000;

Type
 Matrix=Array[0..Dim-1, 0..Dim-1] of double;
 int = longint;

Var
 T0: TDateTime;
 A0, B0, C0: Matrix;
 A, B, C: pdouble;


procedure DGEMM (TRANSA,TRANSB: char;
                  m,n,k: int;
                  alpha: double;
                  A    : pdouble;
                  LDA  : int;
                  B    : pdouble;
                  LDB  : int;
                  beta : double;
                  C    : pdouble;
                  LDC  : int); cdecl; external 'libopenblas';



procedure Init (var M: Matrix);
Var i,j: dword;
Begin
 For i:=0 to Dim-1 do for j:=0 to Dim-1 do M[i,j] := 1.0* (i+j);
End;

begin
  Init (A0);  B0:=A0; C0 := B0;
  A := @A0; B := @B0; C := @C0;
  T0 := Now;;

  DGEMM ('n','n', 1000, 1000, 1000, 1.0, A, 1000, B, 1000, 0.0, C, 1000);

  Writeln('Time elapsed: ',(Now-T0)*3600*24*1000:0:0,'ms.');
end.


EDIT: Wait. It works with the constref prefix for ALL parameters including the chars but except the pfloats.

Performance: Single core AMD Athlon, 32 bit BLAS: 7 times faster than the best pascal code in this thread (excluding engkins assembler version).
« Last Edit: March 21, 2015, 11:06:15 am by Nitorami »

Nitorami

  • Hero Member
  • *****
  • Posts: 605
Re: efficiency problem
« Reply #50 on: March 21, 2015, 11:33:27 am »
Alright, we don't even need the pointers if the matrix is passed by variable.  Attached simplified code.

@photor - instruction:
Download BLAS and mingw pack from http://sourceforge.net/projects/openblas/files/v0.2.13/. Unpack the dlls into a directory of your choice and set this directory under Options / Directories / Libraries in the FPC IDE. Then the code should already compile.

You might consider using the 64bit version, and the multithreaded options proposed by Steindorff and JorgeAldo. On my single core CPU, I do not expect a benefit from multiple threads.

Code: [Select]
{$mode objfpc}
Uses SysUtils;

Const Dim=1000;

Type
 Matrix=Array[0..Dim-1, 0..Dim-1] of double;
 int = longint;

Var
 T0: TDateTime;
 A0, B0, C0: Matrix;

procedure DGEMM (constref TRANSA,TRANSB: char;
                  constref m,n,k: int;
                  constref alpha: double;
                  var A    : Matrix;
                  constref LDA  : int;
                  var B    : Matrix;
                  constref LDB  : int;
                  constref beta : double;
                  var C    : matrix;
                  constref LDC  : int); cdecl; external 'libopenblas';

procedure Init (var M: Matrix);
Var i,j: dword;
Begin
 For i:=0 to Dim-1 do for j:=0 to Dim-1 do M[i,j] := 1.0* (i+j);
End;

begin
  Init (A0);  B0:=A0;
  T0 := Now;;
  DGEMM ('n','n', Dim, Dim, Dim, 1.0, A0, Dim, B0, Dim, 0.0, C0, Dim);
  Writeln('Time elapsed: ',(Now-T0)*3600*24*1000:0:0,'ms.');
end.
« Last Edit: March 21, 2015, 11:41:29 am by Nitorami »

mnar53

  • Newbie
  • Posts: 3
Re: efficiency problem
« Reply #51 on: March 21, 2015, 12:47:08 pm »
I think it's better to use a pointer for your data, it's more general. As an example, using a FreePascal dll and pass to it a vba matrix
(its called pSafeArray, in FreePascal), and from this pSafeArray i can recover a pointer to data. So data are never moving from
vba, neither copied: FreePascal and Openblas operate just on these data.

photor

  • Jr. Member
  • **
  • Posts: 80
Re: efficiency problem
« Reply #52 on: March 21, 2015, 02:58:08 pm »
Thanks.

Alright, we don't even need the pointers if the matrix is passed by variable.  Attached simplified code.

@photor - instruction:
Download BLAS and mingw pack from http://sourceforge.net/projects/openblas/files/v0.2.13/. Unpack the dlls into a directory of your choice and set this directory under Options / Directories / Libraries in the FPC IDE. Then the code should already compile.

You might consider using the 64bit version, and the multithreaded options proposed by Steindorff and JorgeAldo. On my single core CPU, I do not expect a benefit from multiple threads.

Code: [Select]
{$mode objfpc}
Uses SysUtils;

Const Dim=1000;

Type
 Matrix=Array[0..Dim-1, 0..Dim-1] of double;
 int = longint;

Var
 T0: TDateTime;
 A0, B0, C0: Matrix;

procedure DGEMM (constref TRANSA,TRANSB: char;
                  constref m,n,k: int;
                  constref alpha: double;
                  var A    : Matrix;
                  constref LDA  : int;
                  var B    : Matrix;
                  constref LDB  : int;
                  constref beta : double;
                  var C    : matrix;
                  constref LDC  : int); cdecl; external 'libopenblas';

procedure Init (var M: Matrix);
Var i,j: dword;
Begin
 For i:=0 to Dim-1 do for j:=0 to Dim-1 do M[i,j] := 1.0* (i+j);
End;

begin
  Init (A0);  B0:=A0;
  T0 := Now;;
  DGEMM ('n','n', Dim, Dim, Dim, 1.0, A0, Dim, B0, Dim, 0.0, C0, Dim);
  Writeln('Time elapsed: ',(Now-T0)*3600*24*1000:0:0,'ms.');
end.

finalpatch

  • New Member
  • *
  • Posts: 12
Re: efficiency problem
« Reply #53 on: March 23, 2015, 02:40:46 pm »
I don't believe Mathematica actually performs the matrix multiplication.  It is most likely that it merely returns a lazily evaluated object.  This can work because you only see a small portion of the result at any given time, so it only needs to compute the elements that are visible on screen as you scroll through them.

Nitorami

  • Hero Member
  • *****
  • Posts: 605
Re: efficiency problem
« Reply #54 on: March 23, 2015, 07:11:16 pm »
Well possible, it would sound reasonable. But in any case it should be possible to force calculation of the whole matrix by requesting its determinant or similar.

Matlab definitely uses LAPACK to speed up calculations, and I was somewhat surprised to see what optimisation can achieve. Summary for my old machine (Square Matrix 1000 x 1000, double precision):

Photors original code : 23500 ms
More cache friendly layout, by different loop arrangement: 16800ms
More cache friendly layout, by matrix transposition: 8000ms
Either of the two above, but using pointer arithmetics: 4000ms
Further optimisation by loop unrolling : 3000ms - this is I reckon about the best result you can get in Pascal without using assembler programming
OpenBLAS: 560ms

Another factor 3 should be possible by multithreading, and maybe a factor 1.5...2 by going 64bit. Altogether, the 70ms cited by Photor at the start of this thread do not sound unrealistic any longer.



argb32

  • Jr. Member
  • **
  • Posts: 89
    • Pascal IDE based on IntelliJ platform
Re: efficiency problem
« Reply #55 on: March 23, 2015, 09:22:37 pm »
Most likely Mathematica and other specialized software use different algorithms. For example, Strassen algorithm for matrix multiplication has better complexity than O(n^3). Therefore the more is n the more the difference between such an algorithm and a naive ijk (and even ikj) implementation.
Also Strassen seems to be even more cache friendly as it processes matrix by MxM block. Matrices can be stored as a set of such blocks to avoid conversions.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: efficiency problem
« Reply #56 on: March 23, 2015, 10:20:26 pm »
Mathematica uses ATLAS, which during installation tests and selects the fastest kernel. Kernels are optimized to benefit from L1 cache.

It turned out that the bottleneck is in moving data from main memory to registers, not the multiplication itself. By making sure data is already in cache before it is needed that by itself increases the speed a lot. Maybe by calling prefetchX set of instructions to prepare data while doing multiplication on numbers that have the size of a cache line.
« Last Edit: March 23, 2015, 10:31:34 pm by engkin »

photor

  • Jr. Member
  • **
  • Posts: 80
Re: efficiency problem
« Reply #57 on: March 31, 2015, 03:37:44 pm »
I have tested your code. Compiling is ok, but when running it gives an error message "libgfortran-3.dll not found", why?

Alright, we don't even need the pointers if the matrix is passed by variable.  Attached simplified code.

@photor - instruction:
Download BLAS and mingw pack from http://sourceforge.net/projects/openblas/files/v0.2.13/. Unpack the dlls into a directory of your choice and set this directory under Options / Directories / Libraries in the FPC IDE. Then the code should already compile.

You might consider using the 64bit version, and the multithreaded options proposed by Steindorff and JorgeAldo. On my single core CPU, I do not expect a benefit from multiple threads.

Code: [Select]
{$mode objfpc}
Uses SysUtils;

Const Dim=1000;

Type
 Matrix=Array[0..Dim-1, 0..Dim-1] of double;
 int = longint;

Var
 T0: TDateTime;
 A0, B0, C0: Matrix;

procedure DGEMM (constref TRANSA,TRANSB: char;
                  constref m,n,k: int;
                  constref alpha: double;
                  var A    : Matrix;
                  constref LDA  : int;
                  var B    : Matrix;
                  constref LDB  : int;
                  constref beta : double;
                  var C    : matrix;
                  constref LDC  : int); cdecl; external 'libopenblas';

procedure Init (var M: Matrix);
Var i,j: dword;
Begin
 For i:=0 to Dim-1 do for j:=0 to Dim-1 do M[i,j] := 1.0* (i+j);
End;

begin
  Init (A0);  B0:=A0;
  T0 := Now;;
  DGEMM ('n','n', Dim, Dim, Dim, 1.0, A0, Dim, B0, Dim, 0.0, C0, Dim);
  Writeln('Time elapsed: ',(Now-T0)*3600*24*1000:0:0,'ms.');
end.

photor

  • Jr. Member
  • **
  • Posts: 80
Re: efficiency problem
« Reply #58 on: March 31, 2015, 03:58:15 pm »
I have not followed him in a while but you should check into the stuff "aminer" offers.  I believe I saw he offered a multi threaded matrix multiply a while ago were the times were almost 50 times faster than a single thread.
I have checked aminer's stuff, but it has gone.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: efficiency problem
« Reply #59 on: March 31, 2015, 04:07:03 pm »
I have tested your code. Compiling is ok, but when running it gives an error message "libgfortran-3.dll not found", why?

You need to download mingw32_dll.zip (or mingw64_dll.zip) from that same link.

 

TinyPortal © 2005-2018