Recent

Author Topic: FPC for high-performance computing  (Read 6887 times)

gues1

  • Jr. Member
  • **
  • Posts: 80
Re: FPC for high-performance computing
« Reply #60 on: June 03, 2025, 10:39:32 pm »
.....
Additionally, I added a suggestion from @gues1 (thanks).
Code: Pascal  [Select][+][-]
  1.     SetProcessAffinityMask(getcurrentProcess, $FFFF);
  2.  

That API limits the use of virtual CPU (THREADS in HyperThreading system) in the application. Every bit map a virtual CPU and if it is equal to one the Application will use only those THREADS.
In your case, you have 12 THREADS (Power) so the function should be called with:
Code: Pascal  [Select][+][-]
  1.     SetProcessAffinityMask(getcurrentProcess, $FFF);
  2.  

Of course you can try if this is better.

Lenny33

  • New Member
  • *
  • Posts: 49
Re: FPC for high-performance computing
« Reply #61 on: June 04, 2025, 12:05:06 am »
I do machine vision for nearly 19 years now. All computationally heavy, but a few assembler routines here and there, and some manual tuning is enough.

Maybe I would need less manual tuning if I did large parts in C++, but for me, the tuning is less than the trouble of a multi language project. Yes, I already have C++ wrapper DLL, but those are static, while otherwise both codebases would be live.
It's your choice. All I can say is I respect any choice.
But in our case, first of all, we all graduated from universities. And universities are mostly C/C++/Java, not Pascal. Pascal is only for some of us from school years and from 90s-2000s Delphi DB interfaces (mostly for old people 45+).
And secondly, we have people with mathematical education. They also prefer C++ for computational algorithms.
And thirdly, I have already written that we have different tests and the same algorithms of signal filters and other calculations over large data work approximately twice as fast on modern C++ compilers than on Delphi/FPC. And this is critical for us. In our case it is senseless to manually optimize each algorithm for each system and even more senseless to write assembly language for a particular processor. Especially since we have convinced ourselves in practice that C++ code is faster anyway.
But our old interface was in Delphi, so it is much easier and more convenient to translate it to Lazarus than to C++Qt since we don't need any modern decorations in the interface yet.
« Last Edit: June 04, 2025, 12:39:50 am by Lenny33 »

gues1

  • Jr. Member
  • **
  • Posts: 80
Re: FPC for high-performance computing
« Reply #62 on: June 04, 2025, 09:38:51 am »
Hi,
I believe that speed is not everything in the choices related to the development of an algorithm / application.
There are several arguments that lead to the choice of a language or a mix of languages, and even if the choice is not better for speed maybe it is for other characteristics.
If speed were the only element of choice, we would all program in assembler (or rather in machine language directly).
Using pre-built and optimized libraries (and I refer to computer vision, a field where I have also been working for over twenty years) in C / C++ is certainly a MUST given the field of application, but the application that in any case must have comparable characteristics in terms of speed because it must orchestrate everything, I always develop it in Pascal.
And the reasons are simple: it is the language that is more easily maintainable and extensible, whose compiler creates a monolithic executable without carrying tons of frameworks, where normally if you compile (and you have programmed according to the basic rules) you are practically certain that the application is OK.
And also the development of algorithms it is absolutely not certain that the Pascal compiler generates executables MUCH slower than a C++. Some time ago I did a test with a Chinese symmetric encryption algorithm (SM4) testing the application in C++ and its porting in PURE PASCAL, the difference in performance was less than 10%.
Maybe for the development of a server application with thousands of accesses per second this difference is fundamental, but for a client application that needs to access "every now and then" the implementation in Pascal can be more than fine. And it is in Pascal!!!!

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12314
  • FPC developer.
Re: FPC for high-performance computing
« Reply #63 on: June 04, 2025, 10:57:00 am »
Note that the assembler that I use is not really to micro-optimise a pascal piece of code, it is nearly always to invoke SSE2/AVX for whole image operations.

We nearly always use 8-bit, and I found that my own (fairly pedestrian) routines usually outperform public packages as opencv, simply because they are tailor made for the purpose and resolution.

The cases that I do use more complex routines I often look at the simdlib repo which is a mix of templates and simd intrinsics. I would like to have that in Pascal, but that has nothing to do with the language.

gues1

  • Jr. Member
  • **
  • Posts: 80
Re: FPC for high-performance computing
« Reply #64 on: June 04, 2025, 11:21:01 am »
Note that the assembler that I use is not really to micro-optimise a pascal piece of code, it is nearly always to invoke SSE2/AVX for whole image operations.

We nearly always use 8-bit, and I found that my own (fairly pedestrian) routines usually outperform public packages as opencv, simply because they are tailor made for the purpose and resolution.

The cases that I do use more complex routines I often look at the simdlib repo which is a mix of templates and simd intrinsics. I would like to have that in Pascal, but that has nothing to do with the language.
I use Halcon's C libraries and I found them exhaustive and fast: up to this point I have not found faster libraries with optimized functions, even though I also use systems from its competitors (including OpenCV).
I have also optimized several functions in assembler (within Pascal functions) just for a personal use context.
The use of features such as TThreads and Affinity Masks (together with thread priorities) allows for even more advanced optimization, especially in the perspective of using resources to the maximum (see AVX and Core / THREADS).
And with Pascal this is really simple and elementary as well as easily extensible and maintainable.

jwdietrich

  • Hero Member
  • *****
  • Posts: 1260
    • formatio reticularis
Re: FPC for high-performance computing
« Reply #65 on: June 15, 2025, 09:18:34 pm »
A recent video on YouTube demonstrates with a specific example that Pascal generates both fast and efficient code:

https://youtu.be/1f-rUh-hUMs?si=15EUx-u2ROYpXx_S
function GetRandomNumber: integer; // xkcd.com
begin
  GetRandomNumber := 4; // chosen by fair dice roll. Guaranteed to be random.
end;

http://www.formatio-reticularis.de

Lazarus 4.0.0 | FPC 3.2.2 | PPC, Intel, ARM | macOS, Windows, Linux

Thaddy

  • Hero Member
  • *****
  • Posts: 17414
  • Ceterum censeo Trumpum esse delendum (Tnx Charlie)
Re: FPC for high-performance computing
« Reply #66 on: June 16, 2025, 10:45:35 am »
Btw: it is plain wrong to compare a high level compiled language with assembler optimizations (including intrinsics, which is the same, inlined).

There are "rumours" that intrinsics support is being worked on for FPC, but I am not aware of the status.
( and likely CPU specific, which makes it very difficult)
« Last Edit: June 16, 2025, 10:51:48 am by Thaddy »
Due to censorship, I changed this to "Nelly the Elephant". Keeps the message clear.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12314
  • FPC developer.
Re: FPC for high-performance computing
« Reply #67 on: June 16, 2025, 11:45:25 am »
I use Halcon's C libraries

We started out with Euresys (which, back then, came from the same dealer that tries to sell us Halcon). We had a budget sidebusiness for which we developed an own optimized blob, which we later expanded for endless products (we did a lot of linescan work). Back then our own routines were faster than Euresys, but also more specialized.  Euresys back then did some MMX and SSE1 use though, which we didn't.

We did some measurement prototyping with external libraries in the decade following (first Euresys, later opencv, used KLT tracking for a while), and reused or reimplemented some routines from the SIMDLIB project on sourceforge (I assume now github/lab).   Mostly it allowed us to test if primitives would work before rolling our own.

The last 7-8 years we didn't do that much speed dependent work, except for some proof of concept stuff for tenders. I did make some while image processing routines (like simple erosion/dilation kernel operations and some color related stuff) because I wanted to simulate a project at a greater speed than the real application of the project. In general we try to do only one pass of an image using some primitive, and build the rest of the measurement on top of those found primitives without going back to the image.

Currently I'm again working on a quite highspeed tender/project, so I'm dusting off old frameworks and tests. But the measurements are relatively simple (blob and edges), so I don't expect much use for opencv c.s.  Threading is relatively simple (a thread per camera), but with 25 cameras on one system, with a dozen at 300fps (actually at 1000fps, but in 0.1 ms bursts 3 times per second) that is still enough to busy most cores.

The assembler doesn't work in FPC because it doesn't (yet) implement the Delphi x86_64 stack frame directives (.savenv/.pushnv) that make rolling quick SSE2 routines easy.

The pièce de résistance was Nils Haeck's FFT converted to AVX2, but unfortunately while every element was 5-10 times faster, the whole of it was not. (probably only faster when running from uop cache)

Another pet project was rotating 8x8 images (since the mounting orientation of the camera might not be the same as the human looking at the picture):

https://www.stack.nl/~marcov/rot8x8.txt

Quote

and I found them exhaustive and fast: up to this point I have not found faster libraries with optimized functions, even though I also use systems from its competitors (including OpenCV).
I have also optimized several functions in assembler (within Pascal functions) just for a personal use context.
The use of features such as TThreads and Affinity Masks (together with thread priorities) allows for even more advanced optimization, especially in the perspective of using resources to the maximum (see AVX and Core / THREADS).
And with Pascal this is really simple and elementary as well as easily extensible and maintainable.


What kind of primitives/routines do you typically use from those libs?
« Last Edit: June 16, 2025, 12:02:18 pm by marcov »

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12314
  • FPC developer.
Re: FPC for high-performance computing
« Reply #68 on: June 16, 2025, 11:49:00 am »
Btw: it is plain wrong to compare a high level compiled language with assembler optimizations (including intrinsics, which is the same, inlined).

Unless one can, and the other can't. Specially if you must make libraries with broad appeal (many formats etc), stringing your assembler routine together using templates and intrinsics can be very productive.

Quote
There are "rumours" that intrinsics support is being worked on for FPC, but I am not aware of the status.
( and likely CPU specific, which makes it very difficult)

No rumours, IIRC Gareth looked into it, but afaik stopped.

gues1

  • Jr. Member
  • **
  • Posts: 80
Re: FPC for high-performance computing
« Reply #69 on: June 16, 2025, 02:50:59 pm »
We started out with .................
What kind of primitives/routines do you typically use from those libs?
The most used is definitely the transform on the color planes. Followed by the convolution with the Gaussian derivative.
Then the whole basic series of primitives such as segmentation with "rank" or "median" filters.
I also use the threshold often.
But it depends on the type of activity: one of the last ones kept me busy with both classifiers (knn and mlp) and deep learning (training with Halcon and runtime with Yolo implemented in Delphi / C++, not in Python).

I started with OpenCV at the time of Intel (with Borland C++), then I continued for a few more years. When OpenCV took a bad turn (weekly updates were released declared "stable" that were either not stable or brought discontinuity to the code) I switched to Halcon, with a certain reluctance because I knew that I would have "tied my hands and feet" to them.
Instead I have to say that it couldn't have been a better experience: stability in the versions, compatible code and with detailed explanations if ALL the incompatibilities with previous versions.
By the way, the use of OCX (at the beginning of Halcon) allowed me to develop with a certain "detachment" from the code version and also from the environment.
Now that they only export C libraries (in addition to Net) I made a wrapper and let's say I have some "versioning" with $IFDEF but really very little.
Normally I use I9 processors with two or three threads per camera (so far used a maximum of 16 cameras at the same time).
It was a bit hard to optimize the code (for performance) following their super optimized C libraries with the Intel compiler: the function dynamically performs the choice of the technology to use based on the conditions at that particular moment, now it can use AVX then SSE and then maybe even MMX.

... I finish by specifying that I use Delphi for technological and opportunity reasons.

About this and regarding the use of open source development environments (i.e. public domain) and equivalent libraries I have to make you think that for those who have a business like me, the responsibility of using known and "certified" commercial products is an absolute condition.
No customer (at least mine which are only industries or other large businesses) wants to have something developed with non-commercial products (and I had already tried with OpenCV, lowering not only the cost of Halcon runtime licenses but also a further percentage).

Thaddy

  • Hero Member
  • *****
  • Posts: 17414
  • Ceterum censeo Trumpum esse delendum (Tnx Charlie)
Re: FPC for high-performance computing
« Reply #70 on: June 16, 2025, 03:14:33 pm »
OpenCV is just a bundle of intrinsics.
Just for many CPU's and GPU's.
There is already an FPC abstraction for that.
So there is little or no performance to gain.

What I propose is focus on native support for 128, 256, 512 bit native support.
Then we can also finally drop the dreaded 80 bit FPU code that seems to confuse so many people.
But that is no mean feat, although parts on 128 bit is already done.
« Last Edit: June 16, 2025, 03:22:16 pm by Thaddy »
Due to censorship, I changed this to "Nelly the Elephant". Keeps the message clear.

MathMan

  • Sr. Member
  • ****
  • Posts: 411
Re: FPC for high-performance computing
« Reply #71 on: June 16, 2025, 10:28:26 pm »
...

1 - What I propose is focus on native support for 128, 256, 512 bit native support.
2 - Then we can also finally drop the dreaded 80 bit FPU code that seems to confuse so many people.
3 - But that is no mean feat, although parts on 128 bit is already done.

Hello Thaddy,

Can you pls explain in more detail how the three above are connected?

To me it looks like - 1 - is where marcov already stated above that development of it is currently set aside/suspended. But I then fail to see how - 2 - would follow from an available support of - 1. Or do you mean - 1 - as support for IEEE Std 754 (ideally in the 2018 release) Fp128/Fp256/Fp512? At least then I can understand your statement - 3.

Cheers,
MathMan

Thaddy

  • Hero Member
  • *****
  • Posts: 17414
  • Ceterum censeo Trumpum esse delendum (Tnx Charlie)
Re: FPC for high-performance computing
« Reply #72 on: June 17, 2025, 10:32:08 am »
What I really mean is that 80 bit floats are legacy and 128,256 and 512 bit floats do not have high level support while modern processors support those precisions.
It is bloody irritating that some people have the misconception that they need 80 bit floats, whereas modern processors have a much, much higher floating point precision.
The whole  "extended" issue would be gone with high level support.
Does not need to, but might be, just support for intrinsics.

(About the only valid reason I sometimes change to clang or C++)
« Last Edit: June 17, 2025, 10:46:12 am by Thaddy »
Due to censorship, I changed this to "Nelly the Elephant". Keeps the message clear.

MathMan

  • Sr. Member
  • ****
  • Posts: 411
Re: FPC for high-performance computing
« Reply #73 on: June 17, 2025, 12:02:13 pm »
What I really mean is that 80 bit floats are legacy and 128,256 and 512 bit floats do not have high level support while modern processors support those precisions.
It is bloody irritating that some people have the misconception that they need 80 bit floats, whereas modern processors have a much, much higher floating point precision.
The whole  "extended" issue would be gone with high level support.
Does not need to, but might be, just support for intrinsics.

(About the only valid reason I sometimes change to clang or C++)

Thanks for detailing things. Maybe I can help your "bloody irritation", as it seems to stem from a misunderstanding. You seem to be mixing the, e.g. SSE (128 bit), AVX (256 bit), AVX512 (512 bit) on x86-64, SIMD support of modern CPUs with a direct support of IEEE 754 Fp128 (or higher).

There are only very few CPU architectures that actually support Fp128 in hardware (see here https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format#Hardware_support), and none at all for Fp256 or Fp512. So even if there were support for SIMD intrinsics in FPC the demand for "extended floats"/Fp80 would not go away.

Thaddy

  • Hero Member
  • *****
  • Posts: 17414
  • Ceterum censeo Trumpum esse delendum (Tnx Charlie)
Re: FPC for high-performance computing
« Reply #74 on: June 17, 2025, 09:42:33 pm »
That is a good sum up
Due to censorship, I changed this to "Nelly the Elephant". Keeps the message clear.

 

TinyPortal © 2005-2018