Recent

Author Topic: efficiency problem  (Read 56628 times)

photor

  • Jr. Member
  • **
  • Posts: 80
efficiency problem
« on: March 15, 2015, 11:57:18 am »
In my work, I use frequently big matrix multiplications, so its efficiency is vital. In the following simple code, I do a multiplication of two 1000*1000 square matrices, which takes about 11 seconds on my mac book pro with a 2.7 GHz dual core i7 cpu. In comparison, the same matrix multiplication takes only 0.07 second by a simple Mathematica code (I use the version 8.0.4) on the same machine. What's wrong of fpc? And any suggestions? BTW, in both cases, the platform I use is Windows 8.1. Sorry for my poor English.


Code: [Select]
program matmultiply;

Uses SysUtils;

Const
 n=1000;

Type
 Vector=Array[0..n-1] of Extended;
 Matrix=Array[0..n-1] of Vector;

Var
 StartMS:double;
 d,dd:Matrix;

Function Multiply(Const mat1,mat2:Matrix):Matrix;
Var
 i,j,k:Word;
 sum:Extended;
 v:Vector;
Begin
 For i:=0 to n-1 do
  For j:=0 to n-1 do
  Begin
   sum:=0;v:=mat1[i];
   For k:=0 to n-1 do sum+=v[k]*mat2[k,j];
   Multiply[i,j]:=sum;
  End;
End;

Function ChebyPts:Vector;
Var
 i:Word;
Begin
 For i:=0 to n-1 do ChebyPts[i]:=Cos(i*Pi/(n-1));
End;

Function ChebyDiff:Matrix;
Var
 i,j:Word;
 c,CP:Vector;
Begin
 For i:=1 to n-2 do c[i]:=1;
 c[0]:=2;c[n-1]:=2;
 CP:=ChebyPts;
 For i:=0 to n-1 do
  For j:=0 to n-1 do
   if i<>j then ChebyDiff[i,j]:=(c[i]/c[j])*Cos((i+j)*Pi)/(CP[i]-CP[j])
   else if i=0 then ChebyDiff[i,j]:=(2*(n-1)*(n-1)+1)/6
   else if i=n-1 then ChebyDiff[i,j]:=-(2*(n-1)*(n-1)+1)/6
   else ChebyDiff[i,j]:=-0.5*CP[j]/(1-CP[j]*CP[j]);
End;

begin
  d:=ChebyDiff;
  StartMS:=timestamptomsecs(datetimetotimestamp(now));
  dd:=Multiply(d,d);
  Writeln('Serially processed in '+floattostr(timestamptomsecs(datetimetotimestamp(now))-StartMS)+'ms.');
end.


marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12645
  • FPC developer.
Re: efficiency problem
« Reply #1 on: March 15, 2015, 12:26:28 pm »
Using the exact same looping code in Mathematica with the most precise but slow extended type? Or do you use in built-in optimized vector and matrix routines?

Anyway, a simple thing to try, store mat2[k,j] in a variable  before the for k loop. A pointer construct for the most inner loop might also help.

Code: [Select]
Function Multiply(Const mat1,mat2:Matrix):Matrix;
Var
 i,j,k:Word;
 sum:Extended;
 v:Vector;
 mat2value:extended;
 pv : PExtended;  // define as ^extended if not known
 pvend:pextended;
Begin
 For i:=0 to n-1 do
   begin
     v:=mat1[i];   // invariant in j loop.
  For j:=0 to n-1 do
  Begin
   sum:=0; mat2value:=mat2[k,j];
   pv:=@v[0]; pvend:=@v[n-1];
   while (pv<=pvend) do
     begin
       sum:=sum+pv^*mat2value;
       inc(pv);
     end;
   Multiply[i,j]:=sum;
  end;
  End;
End;

Basile B.

  • Guest
Re: efficiency problem
« Reply #2 on: March 15, 2015, 12:36:03 pm »
Are you sure that the Vector elements must be of type Extended ? if you use Single The FP coprocessor wil still process at 80 bit but the data (heap) will be physically stored as 32 bit, which will reduce the processing time by 50 %.

The time delta could be explained by the way mathlab processes (eg SSE optimized matrix multiplication, maybe even open compute ?). You wont be able to get such results without using an optimized assembly routine working with the SSE instruction set (or a special math library). FP is slow.

Pascal is not a domain specific lang and to write such code is not necessarly easy for a scientist while with Matlab you don't have to mess with these tech. details since everything is under the hood.
« Last Edit: March 15, 2015, 12:38:00 pm by Basile B. »

Leledumbo

  • Hero Member
  • *****
  • Posts: 8835
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: efficiency problem
« Reply #3 on: March 15, 2015, 12:37:18 pm »
I believe Mathematica (like MATLAB and Octave) uses parallelization whenever the engine sees a multicore environment. FPC doesn't do that yet. Btw, FPC version and optimization options play much role in floating point intensive programs like this. Your program runs 20 seconds on my Core i5-4200U when compiled plain, but only 11 seconds when I compile with full optimizations (on FPC 3.1.1, 2.6.4 doesn't have most of them) such as: -CX -XXs -O4 -CpCOREAVX2 -OpCOREAVX2 -CfAVX2. Depending on your needs -OoFASTMATH could help, too by reducing precision.

Blaazen

  • Hero Member
  • *****
  • Posts: 3241
  • POKE 54296,15
    • Eye-Candy Controls
Re: efficiency problem
« Reply #4 on: March 15, 2015, 12:52:08 pm »
Also, divisioin by 6 may be slower that multiply by 0.166666667. I don't know if FPC already optimalizes it.
Lazarus 2.3.0 (rev main-2_3-2863...) FPC 3.3.1 x86_64-linux-qt Chakra, Qt 4.8.7/5.13.2, Plasma 5.17.3
Lazarus 1.8.2 r57369 FPC 3.0.4 i386-win32-win32/win64 Wine 3.21

Try Eye-Candy Controls: https://sourceforge.net/projects/eccontrols/files/

photor

  • Jr. Member
  • **
  • Posts: 80
Re: efficiency problem
« Reply #5 on: March 15, 2015, 01:21:48 pm »
The documentation of Mathematica says it works at extended precision by default, which I understand as using the FPU by default. I am not sure if my understanding is exactly correct.
And your code may helps a little, but ... Mathematica still works incredibly much faster than fpc ... I don't know if the situation is similar for C, C++ or Fortran.

Using the exact same looping code in Mathematica with the most precise but slow extended type? Or do you use in built-in optimized vector and matrix routines?

Anyway, a simple thing to try, store mat2[k,j] in a variable  before the for k loop. A pointer construct for the most inner loop might also help.

Code: [Select]
Function Multiply(Const mat1,mat2:Matrix):Matrix;
Var
 i,j,k:Word;
 sum:Extended;
 v:Vector;
 mat2value:extended;
 pv : PExtended;  // define as ^extended if not known
 pvend:pextended;
Begin
 For i:=0 to n-1 do
   begin
     v:=mat1[i];   // invariant in j loop.
  For j:=0 to n-1 do
  Begin
   sum:=0; mat2value:=mat2[k,j];
   pv:=@v[0]; pvend:=@v[n-1];
   while (pv<=pvend) do
     begin
       sum:=sum+pv^*mat2value;
       inc(pv);
     end;
   Multiply[i,j]:=sum;
  end;
  End;
End;

Nitorami

  • Hero Member
  • *****
  • Posts: 605
Re: efficiency problem
« Reply #6 on: March 15, 2015, 01:51:21 pm »
This extreme difference in efficiency is not credible.

I reckon mathematica uses the Strassen Algorithmus which reduces the runtime for quadratic matrix multiplications from N^3 to N^2.807. With N=1000 this would be a runtime reduction to 26% just by the algorithm.
There are even better algorithms (Coppersmith–Winograd) with N^2.37 but wikipedia says these are not practicable yet.

The major efficiency problem in your code is the inner loop and the access to mat2[k,j], where k is the loop variable. Array access is significantly faster if you just increment the index. Maybe you should consider to conjugate one of the matrices before multiplicaction.

And, by the way, I am getting a stack overflow here. You should consider to use dynamic arrays instead of static ones.

On the coprocessor: FPC normally uses th x87 but you may try SSE using the switch {$FPUTYPE SSE3}. Sometimes this produces faster code, but not always.

Also, try to make sure that matlab does not just cache previously calculated results.
http://de.mathworks.com/company/newsletters/articles/matlab-incorporates-lapack.html


« Last Edit: March 15, 2015, 01:57:35 pm by Nitorami »

photor

  • Jr. Member
  • **
  • Posts: 80
Re: efficiency problem
« Reply #7 on: March 15, 2015, 02:10:21 pm »
This is for the purpose of scientific calculation, so at least double precision is needed. And I have tried to re-compile my program using double precision instead of the extended one, but the effect is negligible.
If Mathematica is really using FPU to achieve the extended precision, then there is no SSE or SSE2 optimization. I don't know what open compute is.

Are you sure that the Vector elements must be of type Extended ? if you use Single The FP coprocessor wil still process at 80 bit but the data (heap) will be physically stored as 32 bit, which will reduce the processing time by 50 %.

The time delta could be explained by the way mathlab processes (eg SSE optimized matrix multiplication, maybe even open compute ?). You wont be able to get such results without using an optimized assembly routine working with the SSE instruction set (or a special math library). FP is slow.

Pascal is not a domain specific lang and to write such code is not necessarly easy for a scientist while with Matlab you don't have to mess with these tech. details since everything is under the hood.

Leledumbo

  • Hero Member
  • *****
  • Posts: 8835
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: efficiency problem
« Reply #8 on: March 15, 2015, 02:15:13 pm »
The documentation of Mathematica says it works at extended precision by default, which I understand as using the FPU by default.
On a platform with FPU support, FPC will use FPU. x86(_64) FPU precisions consist of single (32-bit), double (64-bit) and extended (80-bit) precision. Your program when compiled on my machine using:
  • Single precision: 4928 ms
  • Double precision: 10043 ms
  • Extended precision: 11465 ms
Rounded, the empiric difference is 5 : 10 : 11.5. Empiric difference in bit size is 2 : 4 : 5, multiplied by 2.5 to compare with above time results in 5 : 10 : 12.5. Therefore, the difference is speed of these types are roughly appropriate to the bit size.
And, by the way, I am getting a stack overflow here. You should consider to use dynamic arrays instead of static ones.
Same here, I need to convert the code to dynamic array first.

photor

  • Jr. Member
  • **
  • Posts: 80
Re: efficiency problem
« Reply #9 on: March 15, 2015, 02:16:38 pm »
Thanks for so many warm replies  :)
I will consider all these points, try to optimize my code, and let you know the result.

This extreme difference in efficiency is not credible.

I reckon mathematica uses the Strassen Algorithmus which reduces the runtime for quadratic matrix multiplications from N^3 to N^2.807. With N=1000 this would be a runtime reduction to 26% just by the algorithm.
There are even better algorithms (Coppersmith–Winograd) with N^2.37 but wikipedia says these are not practicable yet.

The major efficiency problem in your code is the inner loop and the access to mat2[k,j], where k is the loop variable. Array access is significantly faster if you just increment the index. Maybe you should consider to conjugate one of the matrices before multiplicaction.

And, by the way, I am getting a stack overflow here. You should consider to use dynamic arrays instead of static ones.

On the coprocessor: FPC normally uses th x87 but you may try SSE using the switch {$FPUTYPE SSE3}. Sometimes this produces faster code, but not always.

Also, try to make sure that matlab does not just cache previously calculated results.
http://de.mathworks.com/company/newsletters/articles/matlab-incorporates-lapack.html

Nitorami

  • Hero Member
  • *****
  • Posts: 605
Re: efficiency problem
« Reply #10 on: March 15, 2015, 02:24:50 pm »
You may wish to try an optimised library which is available for Delphi so it should work with FPC.

http://www.optivec.com/

wp

  • Hero Member
  • *****
  • Posts: 13353
Re: efficiency problem
« Reply #11 on: March 15, 2015, 03:09:40 pm »
For the calculation of 1 element of the 1000x1000 result matrix we need 1000 multiplications and additions. For the full matrix computation we need 1000x1000x1000 = 1E9 operations. In the era of GigaHertz processors and assuming a few clock cycles per math operation, the execution time per multiplication/addition must be in the order of 1 ns = 1E-9 s. Therefore the overall calculation time (in one thread) must be in the order of seconds - your FPC result is very reasonable.

I tried you program on Delphi XE2 and it runs for 11 seconds on my machine, exactly the same time as Lazarus.

Bart

  • Hero Member
  • *****
  • Posts: 5675
    • Bart en Mariska's Webstek
Re: efficiency problem
« Reply #12 on: March 15, 2015, 03:27:29 pm »
@Marcov: your code produces an entirely different result matrix than the original one.

Bart


marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12645
  • FPC developer.
Re: efficiency problem
« Reply #14 on: March 15, 2015, 04:13:13 pm »
@Marcov: your code produces an entirely different result matrix than the original one.

Yeah, the mat2 also uses index k, I misread that as [i,j].


 

TinyPortal © 2005-2018