Recent

Author Topic: efficiency problem  (Read 51490 times)

ALLIGATOR

  • Full Member
  • ***
  • Posts: 155
Re: efficiency problem
« Reply #105 on: February 19, 2025, 07:16:27 am »
Does that mean the fp compiler is not smart enough to do unrolling automatically?

At the moment, unfortunately, it is.

Even at the -O3 optimization level, where the “LOOPUNROLL” optimization is activated
-O3: O2 + CONSTPROP + DFA + USELOADMODIFYSTORE + LOOPUNROLL

BUT! A lot of work is being done and sometime in the future it will be possible, I hope. Also, besides the native compiler/optimizer you can use the LLVM backend - the situation will probably be better there (but I haven't tried it because the LLVM backend is not available on Windows yet).

In fact, you were told correctly somewhere above - if you need to squeeze everything out of your hardware - it is better to compile this function in GCC/CLANG/ICC or use ready-made special libraries, or switch to assembler inserts/assembler functions inside Pascal.

But, you may be satisfied with this performance, which can be achieved by pure Pascal at its current level

photor

  • Jr. Member
  • **
  • Posts: 80
Re: efficiency problem
« Reply #106 on: February 19, 2025, 09:45:22 am »
Does that mean the fp compiler is not smart enough to do unrolling automatically?

At the moment, unfortunately, it is.

Even at the -O3 optimization level, where the “LOOPUNROLL” optimization is activated
-O3: O2 + CONSTPROP + DFA + USELOADMODIFYSTORE + LOOPUNROLL

BUT! A lot of work is being done and sometime in the future it will be possible, I hope. Also, besides the native compiler/optimizer you can use the LLVM backend - the situation will probably be better there (but I haven't tried it because the LLVM backend is not available on Windows yet).

In fact, you were told correctly somewhere above - if you need to squeeze everything out of your hardware - it is better to compile this function in GCC/CLANG/ICC or use ready-made special libraries, or switch to assembler inserts/assembler functions inside Pascal.

But, you may be satisfied with this performance, which can be achieved by pure Pascal at its current level

Another question: Is the above speed up by unrolling related to SIMD? And how can we know whether the compiled code uses SIMD or not? I tried to use -CfAVX2 in the command line options, but it seems to have no effect on the speed.
« Last Edit: February 23, 2025, 11:59:24 am by photor »

ALLIGATOR

  • Full Member
  • ***
  • Posts: 155
Re: efficiency problem
« Reply #107 on: February 19, 2025, 10:23:54 am »
Is the above speed up by unrolling related to SIMD?
loop rolling and SIMD are independent things
in the example above, the speedup was due to the fact that the processor was able to use more ALUs available because it realized that it could do this work in parallel at the microcommand level, i.e. calculate twice as much in one cycle.

SIMD - can give even more acceleration, because it will perform x2 - x4 times more operations per one instruction.

And how can we know whether the compiled code uses SIMD or not?
You can find out very simply by looking in the assembly code. Place a breakpoint on the code section (line) of interest, run the program, the breakpoint will be triggered - then call the assembler window - Ctrl+Shift+D (or menu “View” -> “Debug Windows” -> “Assembler”).

But here of course you will need some understanding of assembler of your processor architecture. But at the moment, as far as I know, FPC is not able to take advantage of SIMD to compile custom code. Existing implementations of various RTL subroutines are manually optimized to take full advantage of SIMD, e.g. FillChar, but this is manual optimization, automatic vectorization is not available yet, and intrinsics are also not available yet so that you can use SIMD commands in a convenient way.

I tried to use -CfAVX2 in the command line options, but it seems to have no effect on the speed.
Yes, for the most part, this key will provide little to no speedup since FPC will still use scalar versions of instructions rather than vector versions. Although... once I saw that he was able to use FMA instructions, thus speeding up the code.

By the way, adding -CfAVX2 replaced several instructions with a single SIMD FMA instruction
vfmadd231d xmm4,xmm8,[r11*8+rax+$18]
but, specifically on my hardware, it didn't give any speedup whatsoever

------
Please note that I am not a professional developer, but only an amateur, besides I am not an FPC developer and therefore my words should be treated with a healthy share of skepticism. Perhaps, if I am wrong somewhere, more experienced forum members will correct me (with concrete examples (it would be nice)).

photor

  • Jr. Member
  • **
  • Posts: 80
Re: efficiency problem
« Reply #108 on: February 19, 2025, 11:22:00 am »
Is the above speed up by unrolling related to SIMD?
loop rolling and SIMD are independent things
in the example above, the speedup was due to the fact that the processor was able to use more ALUs available because it realized that it could do this work in parallel at the microcommand level, i.e. calculate twice as much in one cycle.

And how can we know whether the compiled code uses SIMD or not?
You can find out very simply by looking in the assembly code. Place a breakpoint on the code section (line) of interest, run the program, the breakpoint will be triggered - then call the assembler window - Ctrl+Shift+D (or menu “View” -> “Debug Windows” -> “Assembler”).

I tried to use -CfAVX2 in the command line options, but it seems to have no effect on the speed.
Yes, for the most part, this key will provide little to no speedup since FPC will still use scalar versions of instructions rather than vector versions. Although... once I saw that he was able to use FMA instructions, thus speeding up the code.

By the way, adding -CfAVX2 replaced several instructions with a single SIMD FMA instruction
vfmadd231d xmm4,xmm8,[r11*8+rax+$18]
but, specifically on my hardware, it didn't give any speedup whatsoever

Thanks, your words are very helpful. An important way to speed up matrix multiplication is recursive partition, see for example
https://discourse.julialang.org/t/julia-matrix-multiplication-performance/55175/11
which works very well in Julia. I tried to do similar things in fp, but not yet succeeded so far (maybe need more check).
« Last Edit: February 23, 2025, 11:59:03 am by photor »

LV

  • Sr. Member
  • ****
  • Posts: 266
Re: efficiency problem
« Reply #109 on: February 19, 2025, 07:47:09 pm »
https://discourse.julialang.org/t/julia-matrix-multiplication-performance/55175/11
which works very well in Julia. I tried to do similar things in fp, but not yet succeeded so far (maybe need more check).

I was a little interested in Julia programming language.  :)
I will try to transfer this recursive algorithm to Pascal. If I am not mistaken, then

Code: Pascal  [Select][+][-]
  1. program RecursiveMatrixMultiplication;
  2.  
  3. uses
  4.   SysUtils,
  5.   Math,
  6.   DateUtils;
  7.  
  8. const
  9.   N = 1000;
  10.   RECURSION_LIMIT = 256;
  11.  
  12. type
  13.   TMatrix = array [0..N - 1, 0..N - 1] of double;
  14.  
  15. var
  16.   A, B, C: TMatrix;
  17.   StartTime, EndTime: TDateTime;
  18.  
  19.   procedure InitializeMatrix(var M: TMatrix);
  20.   var
  21.     i, j: integer;
  22.   begin
  23.     for i := 0 to N - 1 do
  24.       for j := 0 to N - 1 do
  25.         M[i, j] := Random;
  26.   end;
  27.  
  28.   procedure BaseCaseMultiply(x1, y1, x2, y2, size: integer);
  29.   var
  30.     i, j, k: integer;
  31.     sum: double;
  32.   begin
  33.     for i := 0 to size - 1 do
  34.       for j := 0 to size - 1 do
  35.       begin
  36.         sum := 0.0;
  37.         for k := 0 to size - 1 do
  38.           sum := sum + A[x1 + i, y1 + k] * B[x2 + k, y2 + j];
  39.         C[x1 + i, y2 + j] := C[x1 + i, y2 + j] + sum;
  40.       end;
  41.   end;
  42.  
  43.   procedure RecursiveMultiply(x1, y1, x2, y2, size: integer);
  44.   var
  45.     half: integer;
  46.   begin
  47.     if size <= RECURSION_LIMIT then
  48.       BaseCaseMultiply(x1, y1, x2, y2, size)
  49.     else
  50.     begin
  51.       half := size div 2;
  52.  
  53.       RecursiveMultiply(x1, y1, x2, y2, half);           // A11*B11
  54.       RecursiveMultiply(x1, y1, x2, y2 + half, half);      // A11*B12
  55.       RecursiveMultiply(x1, y1 + half, x2 + half, y2, half); // A12*B21
  56.       RecursiveMultiply(x1, y1 + half, x2 + half, y2 + half, half); // A12*B22
  57.  
  58.       RecursiveMultiply(x1 + half, y1, x2, y2, half);      // A21*B11
  59.       RecursiveMultiply(x1 + half, y1, x2, y2 + half, half); // A21*B12
  60.       RecursiveMultiply(x1 + half, y1 + half, x2 + half, y2, half); // A22*B21
  61.       RecursiveMultiply(x1 + half, y1 + half, x2 + half, y2 + half, half); // A22*B22
  62.     end;
  63.   end;
  64.  
  65. begin
  66.   Randomize;
  67.   InitializeMatrix(A);
  68.   InitializeMatrix(B);
  69.   FillChar(C, SizeOf(C), 0);
  70.  
  71.   StartTime := Now;
  72.   RecursiveMultiply(0, 0, 0, 0, N);
  73.   EndTime := Now;
  74.  
  75.   Writeln('Execution Time: ', MilliSecondsBetween(EndTime, StartTime), ' ms');
  76.   Readln;
  77. end.
  78.  

Code: Text  [Select][+][-]
  1. Execution Time: 1507 ms
  2.  

Parallel multiplication of matrices according to the program from reply #91.


Code: Text  [Select][+][-]
  1. Number of threads
  2. 1                             Execution Time: 1037 ms  
  3. 2                             Execution Time: 523 ms
  4. 3                             Execution Time: 365 ms
  5. 4                             Execution Time: 298 ms
  6. 5                             Execution Time: 218 ms
  7. 6                             Execution Time: 209 ms
  8.  

Desktop i7-8700. Everywhere the optimization level is set -O3.

Updated: Graphics processors and the OpenCL framework are excellent choices for productive computing tasks. You can find open-source developments for Free Pascal by searching online.


« Last Edit: February 20, 2025, 07:11:11 am by LV »

LV

  • Sr. Member
  • ****
  • Posts: 266
Re: efficiency problem
« Reply #110 on: February 20, 2025, 09:39:37 pm »
If we are not focused on educational goals while multiplying two matrices, it's advisable to use libraries instead. For example, Now I spent five minutes using the AlgLib Free Edition library with low-level optimizations inside. To do this, I placed the wrapper file xalglib.pas and the DLL file alglib402_64free.dll in the program folder, then compiled and ran the program.

Desktop i7-8700
Code: Text  [Select][+][-]
  1. Execution Time: 108 ms
  2. IsSame: TRUE
  3.  

Code: Pascal  [Select][+][-]
  1. program MatrixMultiplicationALGLIB;
  2.  
  3. {$mode objfpc}
  4.  
  5. uses
  6.   SysUtils,
  7.   Math,
  8.   DateUtils,
  9.   xalglib;
  10.  
  11. const
  12.   N = 1000;
  13.  
  14. var
  15.   A, B, C, CBase: xalglib.TMatrix;
  16.   i, j: integer;
  17.   StartTime, EndTime: TDateTime;
  18.  
  19.   procedure BaseCaseMultiply(x1, y1, x2, y2, size: integer);
  20.   var
  21.     i, j, k: integer;
  22.     sum: double;
  23.   begin
  24.     for i := 0 to size - 1 do
  25.       for j := 0 to size - 1 do
  26.       begin
  27.         sum := 0.0;
  28.         for k := 0 to size - 1 do
  29.           sum := sum + A[x1 + i, y1 + k] * B[x2 + k, y2 + j];
  30.         CBase[x1 + i, y2 + j] := sum;
  31.       end;
  32.   end;
  33.  
  34.   function IsSame(const mat1, mat2: TMatrix): boolean;
  35.   var
  36.     i, k: nativeuint;
  37.   begin
  38.     Result := True;
  39.     for i := 0 to N - 1 do
  40.       for k := 0 to N - 1 do
  41.         if not SameValue(mat1[i, k], mat2[i, k]) then Exit(False);
  42.   end;
  43.  
  44. begin
  45.   SetLength(A, N, N);
  46.   SetLength(B, N, N);
  47.   SetLength(C, N, N);
  48.   SetLength(CBase, N, N);
  49.  
  50.   Randomize;
  51.   for i := 0 to N - 1 do
  52.     for j := 0 to N - 1 do
  53.     begin
  54.       A[i, j] := Random(10);
  55.       B[i, j] := Random(10);
  56.     end;
  57.  
  58.   StartTime := Now;
  59.   RMatrixGEMM(N, N, N, 1.0, A, 0, 0, 0, B, 0, 0, 0, 0.0, C, 0, 0);
  60.   EndTime := Now;
  61.  
  62.   Writeln('Execution Time: ', MilliSecondsBetween(EndTime, StartTime), ' ms');
  63.  
  64.   BaseCaseMultiply(0, 0, 0, 0, N);
  65.   WriteLn('IsSame: ', IsSame(C, CBase));
  66.  
  67.   readln;
  68. end.
  69.  

In developing this program, I tried to create a block decomposition, which gives an execution time of about 50 ms. However, I did not completely debug it. :(
« Last Edit: February 20, 2025, 09:45:16 pm by LV »

photor

  • Jr. Member
  • **
  • Posts: 80
Re: efficiency problem
« Reply #111 on: February 21, 2025, 08:30:08 am »
If we are not focused on educational goals while multiplying two matrices, it's advisable to use libraries instead. For example, Now I spent five minutes using the AlgLib Free Edition library with low-level optimizations inside. To do this, I placed the wrapper file xalglib.pas and the DLL file alglib402_64free.dll in the program folder, then compiled and ran the program.

But where can I get the AlgLib Free Edition library?
« Last Edit: February 23, 2025, 11:58:11 am by photor »

ALLIGATOR

  • Full Member
  • ***
  • Posts: 155
Re: efficiency problem
« Reply #112 on: February 21, 2025, 09:31:51 am »

photor

  • Jr. Member
  • **
  • Posts: 80
Re: efficiency problem
« Reply #113 on: February 21, 2025, 04:15:06 pm »
You can also try these options:

https://github.com/mikerabat/mrmath
https://github.com/clairvoyant/cblas

Have you tried those packages? I tried to compile cblas according to the instruction, but got 'Could not find unit directory for dependency package "rtl" required for package "cblas" ' error.
mrmath doesn't even have an instruction for installation on its website.
« Last Edit: February 23, 2025, 11:57:46 am by photor »

TRon

  • Hero Member
  • *****
  • Posts: 4359
Re: efficiency problem
« Reply #114 on: February 21, 2025, 04:23:52 pm »
But where can I get the AlgLib Free Edition library?
lmddgtfy
Today is tomorrow's yesterday.

ALLIGATOR

  • Full Member
  • ***
  • Posts: 155
Re: efficiency problem
« Reply #115 on: February 21, 2025, 04:30:37 pm »
Have you tried those packages?

No, I just googled them, and they seemed like good candidates for a more accurate test. (Of course, it's rather expected that the source code may need to be brought up to working condition, since one of the libraries is a bit old and the other seems to be Delphi-only).

LV

  • Sr. Member
  • ****
  • Posts: 266
Re: efficiency problem
« Reply #116 on: February 21, 2025, 06:03:41 pm »
But where can I get the AlgLib Free Edition library?

I returned from work and saw that I received help with the answer. Thanks, @TRon :)
@photor. You can download the zip file from https://www.alglib.net/download.php. Once downloaded, unzip it and copy both the wrapper and the DLL files to the program folder. That's all you need to do.
Also, @ALLIGATOR, you wrote some excellent code above. Would you mind if I attempted to parallelize it?

Code: Pascal  [Select][+][-]
  1. program MatrixMultiplication;
  2. {$mode objfpc}
  3. {$optimization on}
  4. {$MODESWITCH CLASSICPROCVARS+}
  5. {$LONGSTRINGS ON}
  6.  
  7. uses
  8.   SysUtils,
  9.   Classes,
  10.   DateUtils,
  11.   Math;
  12.  
  13. const
  14.   N = 1000;
  15.   unroll = 4;
  16.   ThreadCount = 6;
  17.  
  18. type
  19.   TMatrix = array of double;
  20.   TVector = array of double;
  21.  
  22. var
  23.   A, B, R1, R2, R3: TMatrix;
  24.   StartTime: TDateTime;
  25.  
  26.   procedure InitializeMatrix(var M: TMatrix);
  27.   var
  28.     i, j: integer;
  29.   begin
  30.     for i := 0 to N - 1 do
  31.       for j := 0 to N - 1 do
  32.         M[i * N + j] := Random;
  33.   end;
  34.  
  35.   function Multiply1(const mat1, mat2: TMatrix): TMatrix;
  36.   var
  37.     i, j, k: nativeuint;
  38.     sum: double;
  39.     v: TVector;
  40.   begin
  41.     SetLength(v, N);
  42.     SetLength(Result, N * N);
  43.     for j := 0 to N - 1 do
  44.     begin
  45.       for k := 0 to N - 1 do
  46.         v[k] := mat2[k * N + j];
  47.       for i := 0 to N - 1 do
  48.       begin
  49.         sum := 0;
  50.         for k := 0 to N - 1 do
  51.           sum := sum + mat1[i * N + k] * v[k];
  52.         Result[i * N + j] := sum;
  53.       end;
  54.     end;
  55.   end;
  56.  
  57.   procedure Multiply2Part(const mat1, mat2: TMatrix; var Result: TMatrix;
  58.     start_i, end_i: nativeuint);
  59.   var
  60.     i, j, k: nativeuint;
  61.     sum: double;
  62.     sum1, sum2, sum3, sum4: double;
  63.     v: TVector;
  64.     pm, pv: PDouble;
  65.   begin
  66.     SetLength(v, N);
  67.     for j := 0 to N - 1 do
  68.     begin
  69.       for k := 0 to N - 1 do
  70.         v[k] := mat2[k * N + j];
  71.       for i := start_i to end_i do
  72.       begin
  73.         sum := 0;
  74.         sum1 := 0;
  75.         sum2 := 0;
  76.         sum3 := 0;
  77.         sum4 := 0;
  78.         k := 0;
  79.         pm := @mat1[i * N];
  80.         pv := @v[0];
  81.         while k < (N - unroll + 1) do
  82.         begin
  83.           sum1 := sum1 + pm[k] * pv[k];
  84.           sum2 := sum2 + pm[k + 1] * pv[k + 1];
  85.           sum3 := sum3 + pm[k + 2] * pv[k + 2];
  86.           sum4 := sum4 + pm[k + 3] * pv[k + 3];
  87.           Inc(k, unroll);
  88.         end;
  89.         sum := sum1 + sum2 + sum3 + sum4;
  90.         while k < N do
  91.         begin
  92.           sum := sum + pm[k] * pv[k];
  93.           Inc(k);
  94.         end;
  95.         Result[i * N + j] := sum;
  96.       end;
  97.     end;
  98.   end;
  99.  
  100. type
  101.   TMultThread = class(TThread)
  102.   private
  103.     FStartI, FEndI: nativeuint;
  104.     FMat1, FMat2: TMatrix;
  105.     FResult: TMatrix;
  106.   public
  107.     constructor Create(StartI, EndI: nativeuint; const Mat1, Mat2: TMatrix;
  108.       var ResultMat: TMatrix);
  109.     procedure Execute; override;
  110.   end;
  111.  
  112.   constructor TMultThread.Create(StartI, EndI: nativeuint;
  113.   const Mat1, Mat2: TMatrix; var ResultMat: TMatrix);
  114.   begin
  115.     inherited Create(True);
  116.     FStartI := StartI;
  117.     FEndI := EndI;
  118.     FMat1 := Mat1;
  119.     FMat2 := Mat2;
  120.     FResult := ResultMat;
  121.     FreeOnTerminate := False;
  122.   end;
  123.  
  124.   procedure TMultThread.Execute;
  125.   begin
  126.     Multiply2Part(FMat1, FMat2, FResult, FStartI, FEndI);
  127.   end;
  128.  
  129.   function MultiplyThreaded(const mat1, mat2: TMatrix): TMatrix;
  130.   var
  131.     i: integer;
  132.     start_i, end_i: nativeuint;
  133.     rowsPerThread, remainder: nativeuint;
  134.     Threads: array of TMultThread;
  135.   begin
  136.     SetLength(Result, N * N);
  137.     rowsPerThread := N div ThreadCount;
  138.     remainder := N mod ThreadCount;
  139.  
  140.     SetLength(Threads, ThreadCount);
  141.     for i := 0 to ThreadCount - 1 do
  142.     begin
  143.       start_i := i * rowsPerThread;
  144.       end_i := start_i + rowsPerThread - 1;
  145.       if i = ThreadCount - 1 then
  146.         end_i := end_i + remainder;
  147.       Threads[i] := TMultThread.Create(start_i, end_i, mat1, mat2, Result);
  148.     end;
  149.  
  150.     for i := 0 to ThreadCount - 1 do
  151.       Threads[i].Start;
  152.  
  153.     for i := 0 to ThreadCount - 1 do
  154.       Threads[i].WaitFor;
  155.  
  156.     for i := 0 to ThreadCount - 1 do
  157.       Threads[i].Free;
  158.   end;
  159.  
  160.   function IsSame(const mat1, mat2: TMatrix): boolean;
  161.   var
  162.     i, k: nativeuint;
  163.   begin
  164.     Result := True;
  165.     for i := 0 to N - 1 do
  166.       for k := 0 to N - 1 do
  167.         if not SameValue(mat1[i * N + k], mat2[i * N + k]) then
  168.           Exit(False);
  169.   end;
  170.  
  171. begin
  172.   SetLength(A, N * N);
  173.   SetLength(B, N * N);
  174.  
  175.   Randomize;
  176.   InitializeMatrix(A);
  177.   InitializeMatrix(B);
  178.  
  179.   StartTime := Now;
  180.   R1 := Multiply1(A, B);
  181.   Writeln('Execution Time 1: ', MilliSecondsBetween(Now, StartTime), ' ms');
  182.  
  183.   StartTime := Now;
  184.   R2 := MultiplyThreaded(A, B);
  185.   Writeln('Execution Time 2 (Threaded): ', MilliSecondsBetween(Now, StartTime), ' ms');
  186.  
  187.   WriteLn('IsSame: ', IsSame(R1, R2));
  188.  
  189.   ReadLn;
  190. end.
  191.  

Desktop i7-8700. The optimization level is set -O3.

Code: Text  [Select][+][-]
  1. Execution Time 1: 987 ms
  2. Execution Time 2 (Threaded): 71 ms
  3. IsSame: TRUE
  4.  

I think it's a pretty good result. :)

photor

  • Jr. Member
  • **
  • Posts: 80
Re: efficiency problem
« Reply #117 on: February 23, 2025, 11:54:36 am »
But where can I get the AlgLib Free Edition library?

I returned from work and saw that I received help with the answer. Thanks, @TRon :)
@photor. You can download the zip file from https://www.alglib.net/download.php. Once downloaded, unzip it and copy both the wrapper and the DLL files to the program folder. That's all you need to do.
Also, @ALLIGATOR, you wrote some excellent code above. Would you mind if I attempted to parallelize it?

Code: Text  [Select][+][-]
  1. Execution Time 1: 987 ms
  2. Execution Time 2 (Threaded): 71 ms
  3. IsSame: TRUE
  4.  

I think it's a pretty good result. :)

Good job! Possible to parallelize the previous version using AlgLib Free Edition library?
« Last Edit: February 23, 2025, 11:57:31 am by photor »

LV

  • Sr. Member
  • ****
  • Posts: 266
Re: efficiency problem
« Reply #118 on: February 25, 2025, 07:48:54 pm »
Possible to parallelize the previous version using AlgLib Free Edition library?

I attempted to parallelize the AlgLib Free Edition library. It compiles and executes with a runtime of 50 to 60 ms. However, the results fail the check: IsSame: FALSE. It is suspected that the library is not thread-safe.

Out of curiosity, my colleague ran the same task on MATLAB R2015b (64-bit) and reported an execution time of 90 to 110 ms.

The execution time of the program (Reply #116), which was written in pure Object Pascal, is between 70 and 80 ms.

I must admit that a couple of years ago, I was initially skeptical about using FPC and Lazarus. Now I am convinced that Free Pascal is quite suitable for this type of task.  ;)

P.S. All tests were conducted on a desktop with an Intel i7 8700 processor running Windows 11.

Updated. Matrix multiplication in NumPy takes about 10-15 ms. This library appears to solve the problem using GPU with cuBLAS (NVIDIA)?
« Last Edit: February 25, 2025, 08:54:09 pm by LV »

photor

  • Jr. Member
  • **
  • Posts: 80
Re: efficiency problem
« Reply #119 on: February 28, 2025, 02:14:02 pm »
Possible to parallelize the previous version using AlgLib Free Edition library?

I attempted to parallelize the AlgLib Free Edition library. It compiles and executes with a runtime of 50 to 60 ms. However, the results fail the check: IsSame: FALSE. It is suspected that the library is not thread-safe.

Out of curiosity, my colleague ran the same task on MATLAB R2015b (64-bit) and reported an execution time of 90 to 110 ms.

The execution time of the program (Reply #116), which was written in pure Object Pascal, is between 70 and 80 ms.

I must admit that a couple of years ago, I was initially skeptical about using FPC and Lazarus. Now I am convinced that Free Pascal is quite suitable for this type of task.  ;)

P.S. All tests were conducted on a desktop with an Intel i7 8700 processor running Windows 11.

Updated. Matrix multiplication in NumPy takes about 10-15 ms. This library appears to solve the problem using GPU with cuBLAS (NVIDIA)?

Thanks for the exploration. For my system, an Intel i7 4700 processor running Windows 11, numpy with OpenBLAS as the backend takes about 45 ms in the single-threaded case. With 4 threads, it takes about 15ms. So, anyway, AlgLib's 108 ms with a single thread is not bad. But the multi-threaded version of AlgLib is not free, unfortunately.

 

TinyPortal © 2005-2018