Recent

Author Topic: efficiency problem  (Read 43677 times)

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11452
  • FPC developer.
Re: efficiency problem
« Reply #30 on: March 17, 2015, 10:29:19 am »
I fixed the code from my original post, and it was barely faster, despite the main (k) loop only have one asm instruction different (the rowstride addition from a stack variable instead of just inc(<register>).   Somehow I couldn't convince the compiler to register that (stride) variable.

For me that is a strange result, since in such cases Delphi is faster. Maybe still some inflexibility in the register allocator.

Strange that something like that can have a magnitude 3x effect.  The times for me (i7-3770 3.1.1 -O4): ptrcode 3.3s, original code 10s, my code 9s. (so barely faster than the original array approach).   64-bit was 0.9s but there extended= double=8bytes it seems, corresponding to the observations argb32  already did.

I saw that the mat2 copy is done using stos/stosl, and the compiler seems to do some loop mods itself now.

I'll update this post with the code if I'm on my home machine again. (am home but on work laptop)

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: efficiency problem
« Reply #31 on: March 17, 2015, 06:18:37 pm »
Here is my experiment using SSE2 code:
Code: [Select]
program MatMulSSE2;
uses
  sysutils;

type
  valuetype = double; { should not be changed }

  {$R-}{$Q-}
const
  n = 1000;

var
  M1,M2,TM2,M3: array[1..n,1..n] of valuetype;
  row,col,k: integer;
  T: TDateTime;
  sum: valuetype;
  sums:array[1..2] of valuetype;
  gsum: valuetype;
  pRow, pCol: pointer;

label
  again;
begin
  Randomize;
 for row := 1 to n do
  for col := 1 to n do begin
    { example 1 } //{
    M1[row,col] := row; { or col }
    M2[row,col] := 1;//}

    { example 2 } {
    M1[row,col] := Random*50;
    M2[row,col] := Random*50;//}

    { example 3 } {
    if row = col then
    begin
      M1[row,col] := 1;
      M2[row,col] := 1;
    end
    else
    begin
      M1[row,col] := 0;
      M2[row,col] := 0;
    end;//}

    { Transposition of the second matrix }
    TM2[col, row] := M2[row,col];
  end;
 T := now;
 gsum := 0;
 for row := 1 to n do  //row from 1st matrix
  for col := 1 to n do  //col from 2nd matrix
  begin
    {  original code in Pascal --> to use remove } {
    sum := 0;
    for k:=1 to n do  //col from 1st / row from 2nd
      { using matrix M2 } {
      sum:=sum + M1[row,k]*M2[k,col]; //}

      { using transposed matrix TM2 }
      sum:=sum + M1[row,k]*TM2[col,k];
    M3[row, col] := sum;  //}

    {  SSE2 code } //{
    { SSE2 handles two double values at the same time }
    sums[1] := 0; sums[2] := 0;
    pRow := @M1[row, 1];
    pCol := @TM2[col, 1];

    {$ASMMODE intel}
    asm
      { loop counter }
      mov ecx, n/2

      { sum := 0 }
      movupd xmm2, sums

      { @M1[Row, 1] }
      mov eax, pRow

      { @M2[1, Col] }
      mov ebx, pCol

      again:

      { xmm0 := M1[Row, k .. k+1] }
      movapd xmm0, [eax]
      //movupd xmm0, [eax]

      { xmm1 := TM2[Col, k .. k+1] }
      movapd xmm1, [ebx]

      { xmm0 := M1[Row, k] * M2[k, Col] , M1[Row, k+1] * M2[k+1, Col] }
      mulpd xmm0, xmm1

      { xmm2 := xmm2 + xmm0 }
      addpd xmm2, xmm0

      { point at next pair of columns from M1[Row] }
      add eax, 16 { 2 * size of valuetype }

      { point at next pair of rows from TM2[Col] }
      add ebx, 16

      dec ecx
      jnz again

      { following steps are equivelent to sums[1] := sums[1] + sums[2] }

      { xmm3 := xmm2 }
      movupd xmm3, xmm2

      { copy double value from higher position to lower position in xmm3 }
      unpckhpd xmm3, xmm3

      { add the two numbers: sums[1] := sums[1] + sums[2] }
      addpd xmm2, xmm3
      movupd sums, xmm2
    end ['eax', 'ebx', 'ecx', 'xmm0', 'xmm1', 'xmm2', 'xmm3'];
    M3[row, col] := sums[1];// + sum[2];//}
    gsum := gsum + M3[row, col];
  end;

  WriteLn('Grand sum of all the numbers in the result matrix: ', gsum);
  WriteLn('Time: ', round((now-T)*24*60*60*1000),' ms');
  WriteLn('');
  WriteLn('First 10 rows and 3 columns of the result matrix:');
  for row := 1 to 10 do
  begin
    for col := 1 to 3 do
      Write(M3[row, col], ' ');
    WriteLn('');
  end;
 end.

I adjusted one of the codes posted above. It doubled the speed on my old 32bit system. SSE2 can handle two double precision multiplications at the same time.

Nitorami

  • Sr. Member
  • ****
  • Posts: 496
Re: efficiency problem
« Reply #32 on: March 17, 2015, 06:49:06 pm »
Not bad. On My AMD Athlon, it is 3000ms (vs. 4400ms using plain pascal with SSE2 support). Still, I would think the compiler is not doing a bad job.

Wait... 3700ms without using assembler  :)

Code: [Select]
{$FPUTYPE SSE3}
Function Mulx(Const mat1,mat2:Matrix):Matrix;
Var
  i,j,k,m: longint;
  sum: float;
  u,v: Pfloat;
  mat: Matrix;
Begin
  For i:=0 to n-1 do for j:=0 to n-1 do mat[j,i]:=mat2[i,j];
  For i:=0 to n-1 do
  begin
    For j:=0 to n-1 do begin
      sum:=0;
      u:=@mat1[i];
      v:=@mat [j];
      For k:= 0 to n div 2 -1 do
      Begin
        sum+=u^*v^; Inc(u);Inc(v);
        sum+=u^*v^; Inc(u);Inc(v);
      End;
      result[i,j]:=sum;
    end;
  end;
End;
« Last Edit: March 17, 2015, 07:06:35 pm by Nitorami »

argb32

  • Jr. Member
  • **
  • Posts: 89
    • Pascal IDE based on IntelliJ platform
Re: efficiency problem
« Reply #33 on: March 17, 2015, 09:42:17 pm »
wp, engkin: it's not my idea, it's usual "ikj" algorithm and gives identical result to "ijk" one without any transpositions. The speedup is due to more cache friendly memory access pattern.
There is almost no speedup when Float = Extended and Float = Int64 for some reason. Didn't test in 64-bit mode.
My implementation missing initial zeroing out (but test passed  :o ). Here is fixed one (still 3 times faster for double):
Code: [Select]
Type
  Float = Double;
  PFloat = ^Float;
  Vector=Array[0..n-1] of Float;
  Matrix=Array[0..n-1] of Vector;
  PMatrix = ^Matrix;

procedure Multiply(Const mat1, mat2: PMatrix; var Result: PMatrix);
Var
  i, j, k: Word;
  u, v, x: PFloat;
Begin
  For i:=0 to n-1 do Begin
    u := @mat1^[i];
    x := @Result^[i];
    For k:=0 to n-1 do begin
      x^ := 0;
      inc(x);
    end;
    For k:=0 to n-1 do begin
      v := @mat2^[k];
      x := @Result^[i];
      For j:=0 to n-1 do begin
        x^ += u^*v^;
        inc(v);
        inc(x);
      end;
      inc(u);
    End;
  end;
End;

It's sad that the compiler doesn't do the job with pointer arithmetics for array indexing and let us do it by hand.
Pointer arithmetics is a bad practice.

Leledumbo

  • Hero Member
  • *****
  • Posts: 8757
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: efficiency problem
« Reply #34 on: March 18, 2015, 01:57:57 am »
It's sad that the compiler doesn't do the job with pointer arithmetics for array indexing and let us do it by hand.
Pointer arithmetics is a bad practice.
It is, I'd love to get my hands dirty but I'm lacking the required skills. I'll retry when I get some time, maybe I was just not experienced enough.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: efficiency problem
« Reply #35 on: March 18, 2015, 02:50:22 pm »
@Nitorami, impressive. I wonder if SSE2 speed on AMD CPUs is not as good as on Intel's.

@argb32, without transposition the results are wrong.

Nitorami

  • Sr. Member
  • ****
  • Posts: 496
Re: efficiency problem
« Reply #36 on: March 18, 2015, 06:17:57 pm »
Quote
I wonder if SSE2 speed on AMD CPUs is not as good as on Intel's.
My trusted AMD CPU is 10 years old... don't blame it  ;)

Quote
without transposition the results are wrong

I think not. The transposition is to avoid cache thrashing by accessing elements in steps of 1000. The same can be achieved by restructuring the loop like

Code: [Select]
  For i:=0 to n-1 do
  for k:=0 to n-1 do
  for j:=0 to n-1 do result[i,j] += mat1[i,k]*mat2[k,j];

The result looks ok to me, and the performance is approximately the same as with the transposition method.

Unfortunately, as arg32b said, we really need pointer arithmetics, as the plain pascal code is 3x slower. I found that for simple loops in one-dimensional arrays, FPC seems to produce quite efficient code, but for two-dimensional arrays there is a lot of room for optimisation.


« Last Edit: March 18, 2015, 06:22:06 pm by Nitorami »

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: efficiency problem
« Reply #37 on: March 18, 2015, 06:58:43 pm »
Ah, that explains it. Thanks Nitorami.

photor

  • New Member
  • *
  • Posts: 49
Re: efficiency problem
« Reply #38 on: March 19, 2015, 10:51:20 am »
Maybe free pascal automatically initializes a variable to be zero?
And I forgot to mention that my Windows 8.1 is 64bit.

wp, engkin: it's not my idea, it's usual "ikj" algorithm and gives identical result to "ijk" one without any transpositions. The speedup is due to more cache friendly memory access pattern.
There is almost no speedup when Float = Extended and Float = Int64 for some reason. Didn't test in 64-bit mode.
My implementation missing initial zeroing out (but test passed  :o ). Here is fixed one (still 3 times faster for double):
Code: [Select]
Type
  Float = Double;
  PFloat = ^Float;
  Vector=Array[0..n-1] of Float;
  Matrix=Array[0..n-1] of Vector;
  PMatrix = ^Matrix;

procedure Multiply(Const mat1, mat2: PMatrix; var Result: PMatrix);
Var
  i, j, k: Word;
  u, v, x: PFloat;
Begin
  For i:=0 to n-1 do Begin
    u := @mat1^[i];
    x := @Result^[i];
    For k:=0 to n-1 do begin
      x^ := 0;
      inc(x);
    end;
    For k:=0 to n-1 do begin
      v := @mat2^[k];
      x := @Result^[i];
      For j:=0 to n-1 do begin
        x^ += u^*v^;
        inc(v);
        inc(x);
      end;
      inc(u);
    End;
  end;
End;

It's sad that the compiler doesn't do the job with pointer arithmetics for array indexing and let us do it by hand.
Pointer arithmetics is a bad practice.

photor

  • New Member
  • *
  • Posts: 49
Re: efficiency problem
« Reply #39 on: March 19, 2015, 12:23:11 pm »
Using argb32's idea and your sse3 optimization, now it only takes less than 1s (about 950ms) on my machine:
Code: [Select]
{$FPUTYPE SSE3}
Function Mul(Const mat1,mat2:Matrix):Matrix;
Var
 i,j,k:Word;
 u,v,p:pextended;
Begin
 For i:=0 to n-1 do
 begin
  u:=@mat1[i];
  For j:=0 to n-1 do
  Begin
   v:=@mat2[j];p:=@Mul[i];
   For k:=0 to (n div 2)-1 do
   begin
    p^+=u^*v^;inc(v);inc(p);
    p^+=u^*v^;inc(v);inc(p);
   end;
   inc(u);
  End;
 end;
End;

Not bad. On My AMD Athlon, it is 3000ms (vs. 4400ms using plain pascal with SSE2 support). Still, I would think the compiler is not doing a bad job.

Wait... 3700ms without using assembler  :)

Code: [Select]
{$FPUTYPE SSE3}
Function Mulx(Const mat1,mat2:Matrix):Matrix;
Var
  i,j,k,m: longint;
  sum: float;
  u,v: Pfloat;
  mat: Matrix;
Begin
  For i:=0 to n-1 do for j:=0 to n-1 do mat[j,i]:=mat2[i,j];
  For i:=0 to n-1 do
  begin
    For j:=0 to n-1 do begin
      sum:=0;
      u:=@mat1[i];
      v:=@mat [j];
      For k:= 0 to n div 2 -1 do
      Begin
        sum+=u^*v^; Inc(u);Inc(v);
        sum+=u^*v^; Inc(u);Inc(v);
      End;
      result[i,j]:=sum;
    end;
  end;
End;

photor

  • New Member
  • *
  • Posts: 49
Re: efficiency problem
« Reply #40 on: March 19, 2015, 02:00:20 pm »
No. By looking into the assembler code, I found that free pascal still compile it as two mulsd  instructions, instead of one mulpd instruction. The speed acceleration is due to reduction of jumping time by halving the loops, I think. It is sad that free pascal can't do sse2 optimization in such simple case  :'(

Not bad. On My AMD Athlon, it is 3000ms (vs. 4400ms using plain pascal with SSE2 support). Still, I would think the compiler is not doing a bad job.

Wait... 3700ms without using assembler  :)

Code: [Select]
{$FPUTYPE SSE3}
Function Mulx(Const mat1,mat2:Matrix):Matrix;
Var
  i,j,k,m: longint;
  sum: float;
  u,v: Pfloat;
  mat: Matrix;
Begin
  For i:=0 to n-1 do for j:=0 to n-1 do mat[j,i]:=mat2[i,j];
  For i:=0 to n-1 do
  begin
    For j:=0 to n-1 do begin
      sum:=0;
      u:=@mat1[i];
      v:=@mat [j];
      For k:= 0 to n div 2 -1 do
      Begin
        sum+=u^*v^; Inc(u);Inc(v);
        sum+=u^*v^; Inc(u);Inc(v);
      End;
      result[i,j]:=sum;
    end;
  end;
End;

photor

  • New Member
  • *
  • Posts: 49
Re: efficiency problem
« Reply #41 on: March 19, 2015, 04:33:15 pm »
Added note: My free pascal version is 2.64. I don't know if higher version will do the correct sse2 optimization.

Nitorami

  • Sr. Member
  • ****
  • Posts: 496
Re: efficiency problem
« Reply #42 on: March 19, 2015, 06:50:20 pm »
@photor

Quote
Maybe free pascal automatically initializes a variable to be zero?
No, only dynamic arrays are zeroed on creation with SetLength (). But I think argb32 does in fact zero his tmp variable by
Code: [Select]
    For k:=0 to n-1 do begin
      x^ := 0;
      inc(x);

Quote
And I forgot to mention that my Windows 8.1 is 64bit.
If wou want to use 64 bit, you can download the x64 version of FPC. It is a command line cross compiler and comes without IDE ; to use it with IDE you would have to build it yourselves. At least this is what I understood. I do not know how to do this however.
Quote
It is sad that free pascal can't do sse2 optimization in such simple case
Ja, well, I would love FPC to be the best optimising compiler in the world, but I guess that would need a lot of ressources, which are tied otherwise, for issues such as unicode support and platform independancy. We should also consider that the vast majority of programmers are doing C or one of its clones, and accordingly huge effort has been put into these compilers. It should be no surprise that they may perform better.
On the other hand, algebraic algorithms like matrix multiplication have been optimised manually to buggery over the past 30 years, each CPU manufacturer offering his own optimised BLAS library, and no compiler will reach the performance of these highly tuned libraries anyway. The best solution in your case is probably to link one of these libraries into your code. Why re-inventing the wheel ?

Quote
Added note: My free pascal version is 2.64. I don't know if higher version will do the correct sse2 optimization.
I have tried it with 3.1.1 (as of 29.1.) and the answer is no.



photor

  • New Member
  • *
  • Posts: 49
Re: efficiency problem
« Reply #43 on: March 20, 2015, 03:38:12 pm »
On the other hand, algebraic algorithms like matrix multiplication have been optimised manually to buggery over the past 30 years, each CPU manufacturer offering his own optimised BLAS library, and no compiler will reach the performance of these highly tuned libraries anyway. The best solution in your case is probably to link one of these libraries into your code. Why re-inventing the wheel ?
But how to find and link these libraries, if it is free (I don't want to use commercial things in my program), for example, for an Intel 64bit CPU, in my free pascal program?

Leledumbo

  • Hero Member
  • *****
  • Posts: 8757
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: efficiency problem
« Reply #44 on: March 20, 2015, 06:37:31 pm »
But how to find and link these libraries, if it is free (I don't want to use commercial things in my program), for example, for an Intel 64bit CPU, in my free pascal program?
Googling its name and read the license yourself?
http://www.netlib.org/lapack/
http://www.netlib.org/blas/

 

TinyPortal © 2005-2018