Recent

Author Topic: efficiency problem  (Read 43291 times)

photor

  • New Member
  • *
  • Posts: 49
Re: efficiency problem
« Reply #75 on: April 08, 2015, 03:55:51 pm »
Result comparison on my 2.7GHz i7 cpu with 64bit Win 8.1:
OpenBLAS 64bit: 140~150ms
Mathematica: 70~80ms
Although Mathematica is still two times faster than OpenBLAS, they are comparable.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: efficiency problem
« Reply #76 on: April 08, 2015, 05:13:55 pm »
After I replace "longint" with "int64", the 64bit version works!

That's good news, thanks for reporting back!

Result comparison on my 2.7GHz i7 cpu with 64bit Win 8.1:
OpenBLAS 64bit: 140~150ms
Mathematica: 70~80ms
Although Mathematica is still two times faster than OpenBLAS, they are comparable.

I believe Mathematica uses ATLAS, and probably with threads. If you want to reach to that speed, consider using threads and/or switch to ATLAS.

mnar53

  • Newbie
  • Posts: 3
Re: efficiency problem
« Reply #77 on: April 15, 2015, 08:11:43 pm »
About the "difficulty" of the 64 bit environment: not really. In x64 there is only ONE CALLING CONVENTION,
so you do not care if a 64bit dll was written in freepascal, c, or fortran: it's the same. And forget about
'cdecl' or 'stdcall'.

By the way, this opens up many interesting possibilities: ANY 64 bit dll can be accessed. For example,
CUDA or CUBLAS, if we're after numerical computations on the GPU.

DelphiLazus

  • New Member
  • *
  • Posts: 34
Re: efficiency problem
« Reply #78 on: April 26, 2017, 02:49:45 am »
Good Morning,
Firstly I apologize for restarting this old thread but I think that what has been debated in this thread is the closest to my doubts and that is related to the topic.
If it is necessary to start a new thread I would be grateful if you would let me know.
Secondly, apologize that I have written too much because I have tried to be as precise and detailed as possible, and I hope that I will be understood in spite of my lousy English.

I will explain my case.
I have implemented a "mini framework" dedicated to linear algebra operations and related topics. I have been using it for some time for my projects and it is time to improve it for other projects that require a more robust design.
My "mini framework" is based on dynamic two-dimensional arrays (in the case of matrix), and now that I have become familiar with more advanced topics I am taking a one-dimensional scheme. In other words, try to work the matrix as a vector.
For what I was understanding, when one does SetLength(array, rows, cols) is reserving a memory that while is continuous, internally, each row is "independent" and this leads to requiring double pointers (or pointer To pointers) to access a position (i, j) within the array.
With the one-dimensional scheme we have the advantage that being all continuous, we can implement algorithms more friendly to the cache.

As far as my knowledge goes, I know that it should be possible to improve the performance of my algorithms with the help of pointers and a bit of "magic with the pointer aritemética". So instead of using the classic SetLength (array, dim) I was trying to reserve memory directly like this: AllocMem (punt-array, rows * cols * sizetype).

I did a couple of tests, with the classic multiplication of matrices, to see how it behaves. And I've taken a couple of surprises. It turns out that my algorithms based on SetLength and employing dynamic arrays are relatively faster than my implementations based exclusively on pointers! Could someone explain to me what it could be?
So then I tried with a "mixed" approach that combines both: 1) declare a dynamic array and 2) access it by pointers. Still, the results are more favorable for Setlength versions and dynamic arrays. The mixed approach has turned out to be just slower than the pointer approach.

In turn, I tried two alternatives: to cross the structure to the old school using 3 cycles, as well as a variation with 2 cycles. It turns out that the version of 3 cycles is twice as fast. For example, for a multiplication (600 x 1000) x (1000 x 800) I obtain average times of 5 seconds for the case of 3 cycles, while 11 seconds for the algorithms of 2 cycles. My explanation, or rather hypothesis, is that with the 2-cycle design I must calculate the position (i, j) within A and B. While the 3-cycle version does not require this extra calculation.
My PC is an Intel i7-2670QM CPU 2.20GHz 64bit, Windows 8.1 64bit, 14GB RAM. I use CodeTyphon 5.9, which if I am not mistaken comes with FPC 3.1.1.
Is it that I can no longer improve my algorithms?

Searching over pointers, matrix multiplication, and enhancement techniques I came to this thread. SSeeing the proposals makes me a conceptual mess. The codes I have seen exposed have their differences from mine and I do not get to understand them at all.
As I see it, here we have debated the case of multiplication of quadratic matrices and using fixed-size arrays, and even one of the propositions assumes that it is of even dimension to take advantage of halving operations. My head tells me that the proposals discussed can be generalized to multiplication (m x n) x (n x p). But I'm not entirely sure if the use of fixed or dynamic arrays had any implications or could affect the final results. My question in this regard is Why use fixed size? Does it matter if I use dynamic arrays? Does pointer access change from one to another?

For my part I have started my research on cache-oblivius algorithms. And it is my intention just to go to that. Precisely a block-based matrix multiplication algorithm with cache-oblivius (and not necessarily quadratic matrices) can be used, recursively dividing the matrices according to the larger dimension. My goal is to combine the cache-oblivius algorithm with the best access technique (pointers, or the traditional SetLength and in dynamic arrays). But first I need to clear my doubt whether I'm doing relatively well at working with matrices with pointers as I currently have. Because if I review the results of the times I should be inclined to use SetLength() with dynamic arrays and not to complicate myself with using pointers.

I had come to test with larger matrices (eg, (30000 x 5000) x (5000 x 4000)), and yet the algorithm with SetLength and dynamic array is the winner. I have hypothesized that the larger the structures the difference between the times will decrease, and it is possible that it will find the size in which the version of the algorithm based on pointers win.

I know there are third-party libraries that do good work. I have seen part of the mrMath code, and the truth is that among all the sea of calls to functions it makes a mess to follow the trail. I do not finish assimilating how it works, it seems that it uses something similar to what I tried in my "mixed" design, but it ends up delegating to ASM instructions and I'm not familiar with ASM. I died there.

I also know that there is LAPLACK and BLAS. That today they are already a standard, and I have read that in this thread has been proposed OpenBLAS. Now, it is not clear to me how LAPLACK, BLAS or OpenBLAS is used from Lazarus and / or Delphi. It's possible? How? I do not read anything in your documentation that would give me a definite YES. As far as I can see LAPLACK is written in FORTRAN. It would not be the first time that I encourage to carry code of FORTRAN to Object Pascal ... is a way that I do not discard in if it were necessary to improve my developments. In the long run I hope to be able to expand my work library and study LAPLACK could be productive. Viewing the LAPLACK code could help improve the one I have.

I had even used TPMath (I had to consult it to study Jacobi's algorithm). And now I'm exploring Alglib.
I'm not entirely convinced to use third-party libraries even though they are free for licensing issues. Alglib by case is GPL, and although with TPMath and mrMath apparently there would be in principle problems in use in both proprietary and free projects I am of the idea of not depending much on third party libraries.

It is not my intention to reach the ASM (not at least in the medium term, perhaps sometime in the future), after all I have no knowledge. But if at least I intend to improve my initial work, since I expect in my new projects to handle larger structures than I was accustomed to.

If someone could the code that I attached and give me some north would be very grateful.
The test application consists of 3 SpinEdits, a few buttons and a Memo for logging.
The SpinEdits allow defining the m, n, and p dimensions of the arrays.
Button1 contains an initialization test code with the pointer method.
Button2 cleans the contents of the memo.
Button3: Multiplication algorithm of 3 cycles using pointers.
Button4: Cycle multiplication algorithm using dynamic array.
Button5: Multiplication algorithm of 3 cycles using the mixed version.
Button6: Test button, to display a dialog box the size of a PDouble.
Button7: Multiplication algorithm of 2 cycles using pointers.
Button8: Multiplication algorithm of 2 cycles using dynamic array.
Button9: Multiplication algorithm of 2 cycles using the mixed version.
Button10: no code.

Thank you very much for reading me, and again, apologize for my lousy English.

Regards,

Phil

  • Hero Member
  • *****
  • Posts: 2737
Re: efficiency problem
« Reply #79 on: April 26, 2017, 03:20:40 am »
If someone could the code that I attached and give me some north would be very grateful.

I haven't read the earlier posts to the thread and don't really have much interest (or time) to look at your code, but here's a couple of things to think about and try:

- You probably have at least 4 cores on your computer, so figure out how to use threads in your code. That means breaking it down so that a part of the computations can be run in a thread, independent of the other parts. A threaded app on a 4-core computer can realize up to a 4x speed improvement. That alone may be more than all of your ideas will ever realize. And then your code will also be able to scale up if the need arises. For example, an 8-core computer - no changes needed in the code to get even more speed improvement.

Note that this will mean separating your UI from your non-UI (computational) code.

- Try running your app with the -O2 optimization button selected instead of -O1 in Project Options. In my experience, -O2 can give a nice speed bump when a lot of array accesses are done - but only if range-checking is turned off.

You do have range checking turned off (Lazarus default), although I would turn it on during development and only turn it off for production use. I think of compiler-generated range checks as "assertions for free" since you don't have to do anything except flip on a switch to save a lot of headaches tracking down crashes.


DelphiLazus

  • New Member
  • *
  • Posts: 34
Re: efficiency problem
« Reply #80 on: April 26, 2017, 03:40:49 am »
I did not expect such a quick response.
Thanks Phil for your advice.

I have to confess I had not considered the use of threads. I know there are third-party libraries, and you pay, that use threads. I do not know how they do it.
I would have to research the subject, and I think that would give it too much complexity. More than I can afford.
I can get a rough idea of ​​how a multiplication of matrices with threads could be envisaged. Using block multiplication, each block assigns a thread to it, and at the end of all copying the results to the resulting matrix. This technique would involve doubling the amount of memory.
I do not know if it is possible to make all threads share the same array.

But the point is that it is not just about multiplying matrices only. My library contains many more things, and it is possible that there are algorithms in which this approach is not justified. For example, I think it is not possible to do QR, but Jacobi's method for symmetric matrices if possible.
It would be a lot of work.

For now I will try these days changing the configuration as you suggest.

Regards,
« Last Edit: April 26, 2017, 03:58:47 am by DelphiLazus »

Phil

  • Hero Member
  • *****
  • Posts: 2737
Re: efficiency problem
« Reply #81 on: April 26, 2017, 04:00:01 am »
I do not know if it is possible to make all threads share the same array.

http://wiki.freepascal.org/Multithreaded_Application_Tutorial

I generally don't obsess too much over performance. A lot of optimizing of the pure computational parts of an app is lost once you go into production and actually have to work with large files, Internet connection, etc. A program that does a lot of computations often works with large data sets that have to be read in from somewhere - that's probably the biggest bottleneck right there, not the computation itself.

It's also probably cheaper just to get more hardware than it is to try to coax the same speed improvement out of your code. For example, if we both write the same program, one multi-threaded but without obsessing over the computational performance, the other single-threaded but with highly optimized computations. Then the boss says, we need it to be 2x faster. The first program can achieve that for x, the cost of more hardware. The second program may never be able to achieve it, regardless of how cheaply the programmer can work.

DelphiLazus

  • New Member
  • *
  • Posts: 34
Re: efficiency problem
« Reply #82 on: April 26, 2017, 04:39:37 am »
Thanks for the link.

I tend to worry about algorithms and performance. I try as much as possible to improve them and use the data structures as best as possible.
The current status of my "mini framework" is not optimal. It has its practical utility, but it is time to give it a "little" push. I'm about to embark on a project in which my arrays will not be small and for this I need to give it a polish. Neither will be 1million x 1 million, but an appreciable amount.

He could offer me a more scalable job. Set milestones for me:
1. Improve data structures and algorithms, where possible and allowed.
2. Add multi-thread support

To get to step 2 I see the long term. Firstly I would like to see that it can be improved without having to put a change as radical as the threads.

I understand your point of view, although my intention is to go giving improvements to steps better measured to my knowledge and times.

If I can improve my concept in something, I have already fulfilled my partial objective.
I do not intend, for example, to make the calculation of a 1000x1000 matrix solve it in 1sec. More than anything I try to improve my algorithms to something more decent.
For example, if with some tips I can reduce the calculation from 11sec to 8sec it's already an approximation for me.

Regards,

mischi

  • Full Member
  • ***
  • Posts: 170
Re: efficiency problem
« Reply #83 on: April 26, 2017, 07:56:13 am »
I did not do a thorough investigation, but if you want the ultimate performance, it will be hard to compete with your own program against Lapack/Blas/Atlas. To the best of my knowledge, memory access is the bottle neck when dealing with large number of data. Specifically, you need to take into account the memory cache, in particular with many data and few operations. I once had an example, where setting 4 arrays to zero was the bottle neck. When trying to use 2 or more cores for each array, the program became SLOWER, because the cores were swapping the arrays in and out of the cache.
This can become quite involved and CPU/cache specific. When compiling Atlas (http://math-atlas.sourceforge.net), several methods are actually tested and the best is taken. At least that is what they claim.

Calling C and Fortran routines is not so difficult. You need to translate the function calls in a Pascal header and link the library. I could help you with a simple example. With 2-dimensional arrays, you need to take into account that C has the same row/column layout as Pascal, whereas Fortran has it reversed. I would expect that you may have to make a decision up front, whether you want a pure Pascal solution and do everything by yourself or go for ultimate performance by calling Lapack/Blas.

All above is only to the best of my knowledge and might be utterly wrong ;-)

MiSchi.

Nitorami

  • Sr. Member
  • ****
  • Posts: 481
Re: efficiency problem
« Reply #84 on: April 26, 2017, 10:22:18 am »
Sorry I cannot delve into your code now. Just a few hints-

- You cannot compete against LAPACK or similar, this will be at least factor 3-5 better than your best code

- Array of array of float is very convenient but not too efficient as the matrix may be spread across the memory and the compiler has to calculate the memory location on each access. As you already noted, a linear structure in memory is better, i.e. a vector.

- Don't get yourself into the trouble of manual memory allocation. Use dynamic arrays. You can still use pointer access to them if that is better, just assign a pdouble to @Matrix[0]. Only if you repeatedly have to allocate and deallocate the structure, you should keep in mind that when a dynamic array is created it is automatically initialised with zeroes, which is not the case for a structure generated with New(). The initialisation takes a bit of time but in general this is negligible.

- The major bottleneck with matrix multiplication in my experience is that one matrix can be read linearly, i.e. as it is arranged in memory, but the other not, which may give you extreme cache penalties. Always consider to transpose this matrix before mutliplication so it can be accessed linearly.

Edit: Splitting a task to several threads is not necessarily complicated, see the example for the Mandelbrot calculation here http://benchmarksgame.alioth.debian.org/u64q/program.php?test=mandelbrot&lang=fpascal&id=5

« Last Edit: April 26, 2017, 01:28:00 pm by Nitorami »

DelphiLazus

  • New Member
  • *
  • Posts: 34
Re: efficiency problem
« Reply #85 on: April 26, 2017, 08:01:03 pm »
I did not do a thorough investigation, but if you want the ultimate performance, it will be hard to compete with your own program against Lapack/Blas/Atlas. To the best of my knowledge, memory access is the bottle neck when dealing with large number of data. Specifically, you need to take into account the memory cache, in particular with many data and few operations. I once had an example, where setting 4 arrays to zero was the bottle neck. When trying to use 2 or more cores for each array, the program became SLOWER, because the cores were swapping the arrays in and out of the cache.
This can become quite involved and CPU/cache specific. When compiling Atlas (http://math-atlas.sourceforge.net), several methods are actually tested and the best is taken. At least that is what they claim.
MiSchi.
The exchange of memory in the cache is inevitable. With or without threads.
I do not have much experience with threads, and I'm inclined to think that it may be a little more complicated for me.
I do not see myself wearing a design with threads these days. I'll consider it for a "V3" in my library.

I had not heard of ATLAS. I'll keep that in mind.


Calling C and Fortran routines is not so difficult. You need to translate the function calls in a Pascal header and link the library. I could help you with a simple example. With 2-dimensional arrays, you need to take into account that C has the same row/column layout as Pascal, whereas Fortran has it reversed. I would expect that you may have to make a decision up front, whether you want a pure Pascal solution and do everything by yourself or go for ultimate performance by calling Lapack/Blas.

All above is only to the best of my knowledge and might be utterly wrong ;-)
I do not expect to match a LAPLACK, but at least something could be learned and used as a guide to know where to improve my algorithms. So I told myself that if necessary, I'm not afraid to explore the source code and take note.
If you say that it is possible to call LAPLACK routines, I would be grateful if in some free moment you can give me that you give me some feedback. I am aware that in FORTRAN the matrix are col-mayor-order.

To my library, I'm preparing it to deal with "logical layouts" both by rows and columns. This allows me to prepare variants of algorithms that take advantage of the best situations.
By logical layout I mean that while physically in Pascal the arrays are row-major-order. At the logical level, when you create the matrices, you are given a layout on how the data will be organized. Thus, for example, if I want the matrix A of dimension (10 x 30) to be col-greater, the positions 0..9 will correspond to the 1st column; The indices 10..19 to the 2nd column, etc.

Sorry I cannot delve into your code now. Just a few hints-

- You cannot compete against LAPACK or similar, this will be at least factor 3-5 better than your best code
I know. What I have assumed. I can not compete against geniuses who have years perfecting that huge library.
As I said a few paragraphs before, at least I could hope to learn something.

- Array of array of float is very convenient but not too efficient as the matrix may be spread across the memory and the compiler has to calculate the memory location on each access. As you already noted, a linear structure in memory is better, i.e. a vector.

- Don't get yourself into the trouble of manual memory allocation. Use dynamic arrays. You can still use pointer access to them if that is better, just assign a pdouble to @Matrix[0]. Only if you repeatedly have to allocate and deallocate the structure, you should keep in mind that when a dynamic array is created it is automatically initialised with zeroes, which is not the case for a structure generated with New(). The initialisation takes a bit of time but in general this is negligible.
That proposal was my third attempt, which I called "mixed" since it combines the declaration of a dynamic array with access by pointers.
The test I did gave me that it was a bit slower this way than my previous attempts which is based on using pointers in an exclusive way.
To be precise the ranking always, independent of the sizes that I have tried, it was like this:

1st place: Algorithm using SetLength () and dynamic arrays
2nd place: algorithm using AllocMem () and use of pointers
3rd place: mixed algorithm: reserve dynamic array with SetLength () and access by pointers

I have no explanation for this.

- The major bottleneck with matrix multiplication in my experience is that one matrix can be read linearly, i.e. as it is arranged in memory, but the other not, which may give you extreme cache penalties. Always consider to transpose this matrix before mutliplication so it can be accessed linearly.

Edit: Splitting a task to several threads is not necessarily complicated, see the example for the Mandelbrot calculation here http://benchmarksgame.alioth.debian.org/u64q/program.php?test=mandelbrot&lang=fpascal&id=5
I am aware that the traditional algorithm suffers from cache penalties.
It is an alternative to use the transpose of the matrix "B", although unless it is initially planned to store it like this, it is a cost to take into account, in terms of T() (since in the expression O() the asymptote passes through a grade 3 in case of cuadratic matrix).
I am of the idea that if you can prevent the T() value from growing, it is better. In case of opting for transpose, there are also cache-oblivius algorithms; Which I intend to include.

As I said before, my library is preparing for you to accept layout logically. I'm going to implement overloaded versions that work with different layouts.

A way to deal with the penalties of the cache and avoid dealing with being applied to the transpose, and assuming row-order is to recursively divide the matrices and multiply in blocks until you reach a reasonable size and apply unroot lop. The description of the algorithm can be read in the following paper:
http://Http://supertech.csail.mit.edu/papers/Prokop99.pdf

My goal in this thread is to learn and inform me of the best proposal and alternative to reserve and access an array structure.
Test with the matrix multiplication algorithm in a traditional way without including improvements such as cycle ikj instead of ijk or use multiplication by blocks, and without transpositions is my way of evaluating how different techniques can behave.

The truth is that I did not expect the simplest way to work, which is to use SetLength () and the traditional M [] to give me a better result than using pointers.
I miss it in over way when I explore for example the mrMath code and I see that it uses pointers (and yes, a good amount of ASM) or in Alglib (in its free version) use dynamic array only.

It is expected a trend and that the ranking that I exposed above will continue to be maintained for other situations, operations, and algorithms.
If without improvements already obtains the number 1, with improvements, will continue in the trend. At least I understand that.
Or is there something wrong I'm doing with pointers, or is there "compiler magic" behind the use of dynamic arrays and that Lazarus / FreePascal takes very good advantage of it.

Regards,

mischi

  • Full Member
  • ***
  • Posts: 170
Re: efficiency problem
« Reply #86 on: April 26, 2017, 08:32:39 pm »
If you can come up with an simple example, which uses Lapack/Blas routines, I could try to help out, like a product of a vector and a matrix or similar.

DelphiLazus

  • New Member
  • *
  • Posts: 34
Re: efficiency problem
« Reply #87 on: April 27, 2017, 03:43:15 am »
If you can come up with an simple example, which uses Lapack/Blas routines, I could try to help out, like a product of a vector and a matrix or similar.
I think I did not explain myself very well.
I never used or invoked any FORTRAN function since Delphi or Lazarus. I do not know how to do it.
What I did a couple of times is to read loose code in FORTRAN and port it to Object Pascal.

So I had commented that if it is really possible to use LAPLACK from Delphi and/or Lazarus, at some free time you have, if you could guide me through the process. The test algorithms I could already do to check how they work for me.

I found a site after a couple of hours, in Russian, that seems to be reliable and explains how to install LAPLACK and use it from Delphi. He would think that in a similar way it would be possible to do so with Lazarus.
http://lib.dystlab.com/index.php/content/it/50-how-to-use-blas-lapack-in-delphi

I will be able to prove this within 2 days, for reasons of work I must be absent and I will not be in line so often.

Regards,

mischi

  • Full Member
  • ***
  • Posts: 170
Re: efficiency problem
« Reply #88 on: April 27, 2017, 09:33:50 am »
I do not know Russian, but the examples look ok. At least the overall direction is right. I do not know Windows very well and the installation of blas/lapack on windows requires minGW and is quite elaborate. Unfortunately, i will not really be able to help you much on this. Linux and macOS are easier at this point, because the come with Blas/Lapack.

The example loads the library dynamically at run time. Another option is to link it directly. It does not make a big difference.

The example also contains the translation of a function header to Pascal:

TBlasDGEMM = function(TRANSA, TRANSB: PChar; M, N, K: PInteger; ALPHA, A: PDouble; LDA: PInteger; B: PDouble; LDB: PInteger; BETA, C: PDouble; LDC: PInteger): Integer; cdecl;

You need such declarations for all functions you are using and I can help you doing the translation.
As you may see, all parameters are pointers. In Fortran, parameters are always passed by reference, whereas C passes by value. Pascal can do both. For passing by reference, you pass a pointer or use the keyword var.

i never used BLAS/LAPACK extensively, but about 20 other libraries including ffmpeg, which is more complex.

For a start, I suggest to construct a minimal console example for testing, whether installation and building are successful. I can test it on macOS and pass it to you as a reference.

MiSchi.

 

TinyPortal © 2005-2018