I did not do a thorough investigation, but if you want the ultimate performance, it will be hard to compete with your own program against Lapack/Blas/Atlas. To the best of my knowledge, memory access is the bottle neck when dealing with large number of data. Specifically, you need to take into account the memory cache, in particular with many data and few operations. I once had an example, where setting 4 arrays to zero was the bottle neck. When trying to use 2 or more cores for each array, the program became SLOWER, because the cores were swapping the arrays in and out of the cache.
This can become quite involved and CPU/cache specific. When compiling Atlas (http://math-atlas.sourceforge.net), several methods are actually tested and the best is taken. At least that is what they claim.
MiSchi.
The exchange of memory in the cache is inevitable. With or without threads.
I do not have much experience with threads, and I'm inclined to think that it may be a little more complicated for me.
I do not see myself wearing a design with threads these days. I'll consider it for a "V3" in my library.
I had not heard of ATLAS. I'll keep that in mind.
Calling C and Fortran routines is not so difficult. You need to translate the function calls in a Pascal header and link the library. I could help you with a simple example. With 2-dimensional arrays, you need to take into account that C has the same row/column layout as Pascal, whereas Fortran has it reversed. I would expect that you may have to make a decision up front, whether you want a pure Pascal solution and do everything by yourself or go for ultimate performance by calling Lapack/Blas.
All above is only to the best of my knowledge and might be utterly wrong ;-)
I do not expect to match a LAPLACK, but at least something could be learned and used as a guide to know where to improve my algorithms. So I told myself that if necessary, I'm not afraid to explore the source code and take note.
If you say that it is possible to call LAPLACK routines, I would be grateful if in some free moment you can give me that you give me some feedback. I am aware that in FORTRAN the matrix are col-mayor-order.
To my library, I'm preparing it to deal with "logical layouts" both by rows and columns. This allows me to prepare variants of algorithms that take advantage of the best situations.
By logical layout I mean that while physically in Pascal the arrays are row-major-order. At the logical level, when you create the matrices, you are given a layout on how the data will be organized. Thus, for example, if I want the matrix A of dimension (10 x 30) to be col-greater, the positions 0..9 will correspond to the 1st column; The indices 10..19 to the 2nd column, etc.
Sorry I cannot delve into your code now. Just a few hints-
- You cannot compete against LAPACK or similar, this will be at least factor 3-5 better than your best code
I know. What I have assumed. I can not compete against geniuses who have years perfecting that huge library.
As I said a few paragraphs before, at least I could hope to learn something.
- Array of array of float is very convenient but not too efficient as the matrix may be spread across the memory and the compiler has to calculate the memory location on each access. As you already noted, a linear structure in memory is better, i.e. a vector.
- Don't get yourself into the trouble of manual memory allocation. Use dynamic arrays. You can still use pointer access to them if that is better, just assign a pdouble to @Matrix[0]. Only if you repeatedly have to allocate and deallocate the structure, you should keep in mind that when a dynamic array is created it is automatically initialised with zeroes, which is not the case for a structure generated with New(). The initialisation takes a bit of time but in general this is negligible.
That proposal was my third attempt, which I called "mixed" since it combines the declaration of a dynamic array with access by pointers.
The test I did gave me that it was a bit slower this way than my previous attempts which is based on using pointers in an exclusive way.
To be precise the ranking always, independent of the sizes that I have tried, it was like this:
1st place: Algorithm using SetLength () and dynamic arrays
2nd place: algorithm using AllocMem () and use of pointers
3rd place: mixed algorithm: reserve dynamic array with SetLength () and access by pointers
I have no explanation for this.
- The major bottleneck with matrix multiplication in my experience is that one matrix can be read linearly, i.e. as it is arranged in memory, but the other not, which may give you extreme cache penalties. Always consider to transpose this matrix before mutliplication so it can be accessed linearly.
Edit: Splitting a task to several threads is not necessarily complicated, see the example for the Mandelbrot calculation here http://benchmarksgame.alioth.debian.org/u64q/program.php?test=mandelbrot&lang=fpascal&id=5
I am aware that the traditional algorithm suffers from cache penalties.
It is an alternative to use the transpose of the matrix "B", although unless it is initially planned to store it like this, it is a cost to take into account, in terms of T() (since in the expression O() the asymptote passes through a grade 3 in case of cuadratic matrix).
I am of the idea that if you can prevent the T() value from growing, it is better. In case of opting for transpose, there are also cache-oblivius algorithms; Which I intend to include.
As I said before, my library is preparing for you to accept layout logically. I'm going to implement overloaded versions that work with different layouts.
A way to deal with the penalties of the cache and avoid dealing with being applied to the transpose, and assuming row-order is to recursively divide the matrices and multiply in blocks until you reach a reasonable size and apply unroot lop. The description of the algorithm can be read in the following paper:
http://Http://supertech.csail.mit.edu/papers/Prokop99.pdfMy goal in this thread is to learn and inform me of the best proposal and alternative to reserve and access an array structure.
Test with the matrix multiplication algorithm in a traditional way without including improvements such as cycle ikj instead of ijk or use multiplication by blocks, and without transpositions is my way of evaluating how different techniques can behave.
The truth is that I did not expect the simplest way to work, which is to use SetLength () and the traditional M [] to give me a better result than using pointers.
I miss it in over way when I explore for example the mrMath code and I see that it uses pointers (and yes, a good amount of ASM) or in Alglib (in its free version) use dynamic array only.
It is expected a trend and that the ranking that I exposed above will continue to be maintained for other situations, operations, and algorithms.
If without improvements already obtains the number 1, with improvements, will continue in the trend. At least I understand that.
Or is there something wrong I'm doing with pointers, or is there "compiler magic" behind the use of dynamic arrays and that Lazarus / FreePascal takes very good advantage of it.
Regards,