Recent

Author Topic: FPC for high-performance computing  (Read 6359 times)

LV

  • Sr. Member
  • ****
  • Posts: 290
FPC for high-performance computing
« on: May 18, 2025, 12:54:05 am »
In one thread, one of the forum guests claimed that FPC is very slow and unsuitable for high-performance computing. Many arguments were given, from the Big Bang concept, the campaigns of Alexander the Great, the theory of evolution, to the present day.  :) This is very interesting and exciting, but my modest experience of using FPC-Lazarus for high-performance simulations, where resource-intensive algebraic operations occupy a significant share, does not confirm this thesis. There were the source codes and the results of comparison with GFortran and C++ in single-threaded tests, which showed a difference in execution speed of 5-10%.
The situation is about the same in multi-threaded tests, ~ 10%. Here is an example of comparison under equal conditions of algebraic matrix operations with a dimension of 5000 x 5000 (on the same hardware and the same program logic using Fortran, C++, and Free Pascal).

Code: Text  [Select][+][-]
  1. -------------------------------------------------------------------------
  2.             |      GFortan       |          C++       |       FPC       |
  3.             |  (GCC 14.2.0 -O3)  |   (GCC 14.2.0 -O3) |   3.2.2 (-O3)   |
  4. -------------------------------------------------------------------------
  5. Time,  s    |       14.1         |          14.1      |      15.1       |
  6. -------------------------------------------------------------------------
  7.  

Source codes in the appendix.

Could someone please share their experience using FPC for intensive computing?

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 11349
  • Debugger - SynEdit - and more
    • wiki
Re: FPC for high-performance computing
« Reply #1 on: May 18, 2025, 02:04:32 am »
First of all, I have no examples of my own, none were I have comparisons to other compilers.

I have some code that needs to run fast. And in some cases, I had to "help" fpc. That is I had to give up readability, and e.g. introduce low level pointer stuff. But again, I can't tell how other compilers would have done, and if they would have done the needed optimizations without my sacrifice. (I was told gcc should do it, but well: hear say)




Looking at your example: I think your code will have very limited gains from "classic" optimizations. That is, the speed is not determined by how optimal the assemble in the loop is.

Your matrix calculation probably (I have not tested / just from review) have the CPU waiting for the RAM.
The reading of
   FBTransposed[j * MATRIX_SIZE + k]
means you read bits of data at a distance of 5k.
Or in other word: One cache miss follows the next. The CPU is spending a lot of time waiting, waiting for data to be retrieved from the RAM.

IIRC (it is ages since I looked at that) there are some smart ways to access matrices by accessing small parts of them, and reduce the amount of cache misses.

But making the code cache friendly is a matter of algorithm. No compiler will change your algorithm. So that needs to be done in the code.

I haven't spent time to re-read it now, but IIRC the wikipedia has some details: https://en.wikipedia.org/wiki/Matrix_multiplication_algorithm (read the cache paragraph)

I wonder what the numbers are, if you optimise for cache usage.

Fibonacci

  • Hero Member
  • *****
  • Posts: 754
  • Internal Error Hunter
Re: FPC for high-performance computing
« Reply #2 on: May 18, 2025, 03:52:12 am »
ChaCha20, 1 GB

Code: Text  [Select][+][-]
  1. FPC = 6671 ms (153,50 MB/s)
  2. C   = 2594 ms (394.76 MB/s)

Code: Pascal  [Select][+][-]
  1. uses sysutils, strutils;
  2.  
  3. type
  4.   chacha20state = record
  5.     state: array[0..15] of dword;
  6.     keystream: array[0..15] of dword;
  7.     position: dword;
  8.   end;
  9.  
  10. procedure chacha20_set_counter(var state: chacha20state; counter: dword);
  11. begin
  12.   state.state[12] := counter;
  13.   state.position := 64;
  14. end;
  15.  
  16. procedure chacha20_init(var state: chacha20state; key, nonce: string; counter: dword=0);
  17. const
  18.   magic = 'expand 32-byte k';
  19. begin
  20.   fillchar(state, sizeof(state), 0);
  21.   move(magic[1], state.state[0], 16);
  22.   if length(key) > 32 then setlength(key, 32);
  23.   if key <> '' then move(key[1], state.state[4], length(key));
  24.   chacha20_set_counter(state, counter);
  25.   if length(nonce) > 12 then setlength(nonce, 12);
  26.   if nonce <> '' then move(nonce[1], state.state[13], length(nonce));
  27. end;
  28.  
  29. function rotl32(x, n: dword): dword; inline;
  30. begin
  31.   result := (x shl n) or (x shr (32-n));
  32. end;
  33.  
  34. procedure chacha20_quarterround(p: pdword; a, b, c, d: byte); inline;
  35. begin
  36.   p[a] += p[b]; p[d] := rotl32(p[d] xor p[a], 16);
  37.   p[c] += p[d]; p[b] := rotl32(p[b] xor p[c], 12);
  38.   p[a] += p[b]; p[d] := rotl32(p[d] xor p[a], 8);
  39.   p[c] += p[d]; p[b] := rotl32(p[b] xor p[c], 7);
  40. end;
  41.  
  42. procedure chacha20_next_block(var state: chacha20state);
  43. var
  44.   i: integer;
  45. begin
  46.   move(state.state, state.keystream, 64);
  47.   for i := 1 to 10 do begin
  48.     chacha20_quarterround(@state.keystream, 0, 4, 8, 12);
  49.     chacha20_quarterround(@state.keystream, 1, 5, 9, 13);
  50.     chacha20_quarterround(@state.keystream, 2, 6, 10, 14);
  51.     chacha20_quarterround(@state.keystream, 3, 7, 11, 15);
  52.     chacha20_quarterround(@state.keystream, 0, 5, 10, 15);
  53.     chacha20_quarterround(@state.keystream, 1, 6, 11, 12);
  54.     chacha20_quarterround(@state.keystream, 2, 7, 8, 13);
  55.     chacha20_quarterround(@state.keystream, 3, 4, 9, 14);
  56.   end;
  57.   for i := 0 to high(state.keystream) do state.keystream[i] += state.state[i];
  58.   state.state[12] += 1;
  59.   state.position := 0;
  60. end;
  61.  
  62. procedure chacha20_xor(var state: chacha20state; data: pointer; len: dword);
  63. var
  64.   i: dword;
  65. begin
  66.   for i := 0 to len-1 do begin
  67.     if state.position >= 64 then chacha20_next_block(state);
  68.     pbyte(data+i)^ := pbyte(data+i)^ xor pbyte(@state.keystream[0]+state.position)^;
  69.     inc(state.position);
  70.   end;
  71. end;
  72.  
  73. const
  74.   size = 1024*1024*1024;
  75.  
  76. var
  77.   cc: chacha20state;
  78.   data: pointer;
  79.   start, end_: qword;
  80.  
  81. begin
  82.   chacha20_init(cc, DupeString('a', 32), DupeString('b', 12));
  83.   data := getmem(size);
  84.   start := GetTickCount64;
  85.   chacha20_xor(cc, data, size);
  86.   end_ := GetTickCount64;
  87.   writeln(end_-start, ' ms');
  88.   writeln('verify = ', pint32(@cc.keystream)^);
  89.   writeln(Format('%.2f MB/s', [(size/(1024*1024))/((end_-start)/1000)]));
  90.   readln;
  91. end.

Code: C  [Select][+][-]
  1. #include <windows.h>
  2. #include <stdint.h>
  3. #include <stdio.h>
  4. #include <assert.h>
  5.  
  6. struct chacha20_context {
  7.   uint32_t keystream32[16];
  8.   size_t position;
  9.  
  10.   uint8_t key[32];
  11.   uint8_t nonce[12];
  12.   uint64_t counter;
  13.  
  14.   uint32_t state[16];
  15. };
  16.  
  17. static uint32_t rotl32(uint32_t x, int n) {
  18.   return (x << n) | (x >> (32 - n));
  19. }
  20.  
  21. static uint32_t pack4(const uint8_t *a) {
  22.   uint32_t res = 0;
  23.   res |= (uint32_t)a[0] << 0 * 8;
  24.   res |= (uint32_t)a[1] << 1 * 8;
  25.   res |= (uint32_t)a[2] << 2 * 8;
  26.   res |= (uint32_t)a[3] << 3 * 8;
  27.   return res;
  28. }
  29.  
  30. static void chacha20_init_block(struct chacha20_context *ctx, uint8_t key[], uint8_t nonce[]) {
  31.   memcpy(ctx->key, key, sizeof(ctx->key));
  32.   memcpy(ctx->nonce, nonce, sizeof(ctx->nonce));
  33.  
  34.   const uint8_t *magic_constant = (uint8_t*)"expand 32-byte k";
  35.   ctx->state[0] = pack4(magic_constant + 0 * 4);
  36.   ctx->state[1] = pack4(magic_constant + 1 * 4);
  37.   ctx->state[2] = pack4(magic_constant + 2 * 4);
  38.   ctx->state[3] = pack4(magic_constant + 3 * 4);
  39.   ctx->state[4] = pack4(key + 0 * 4);
  40.   ctx->state[5] = pack4(key + 1 * 4);
  41.   ctx->state[6] = pack4(key + 2 * 4);
  42.   ctx->state[7] = pack4(key + 3 * 4);
  43.   ctx->state[8] = pack4(key + 4 * 4);
  44.   ctx->state[9] = pack4(key + 5 * 4);
  45.   ctx->state[10] = pack4(key + 6 * 4);
  46.   ctx->state[11] = pack4(key + 7 * 4);
  47.   ctx->state[12] = 0;
  48.   ctx->state[13] = pack4(nonce + 0 * 4);
  49.   ctx->state[14] = pack4(nonce + 1 * 4);
  50.   ctx->state[15] = pack4(nonce + 2 * 4);
  51.  
  52.   memcpy(ctx->nonce, nonce, sizeof(ctx->nonce));
  53. }
  54.  
  55. static void chacha20_block_set_counter(struct chacha20_context *ctx, uint64_t counter) {
  56.   ctx->state[12] = (uint32_t)counter;
  57.   ctx->state[13] = pack4(ctx->nonce + 0 * 4) + (uint32_t)(counter >> 32);
  58. }
  59.  
  60. static void chacha20_block_next(struct chacha20_context *ctx) {
  61.   for (int i = 0; i < 16; i++) ctx->keystream32[i] = ctx->state[i];
  62.  
  63. #define CHACHA20_QUARTERROUND(x, a, b, c, d) \
  64.   x[a] += x[b]; x[d] = rotl32(x[d] ^ x[a], 16); \
  65.   x[c] += x[d]; x[b] = rotl32(x[b] ^ x[c], 12); \
  66.   x[a] += x[b]; x[d] = rotl32(x[d] ^ x[a], 8); \
  67.   x[c] += x[d]; x[b] = rotl32(x[b] ^ x[c], 7);
  68.  
  69.   for (int i = 0; i < 10; i++) {
  70.     CHACHA20_QUARTERROUND(ctx->keystream32, 0, 4, 8, 12)
  71.     CHACHA20_QUARTERROUND(ctx->keystream32, 1, 5, 9, 13)
  72.     CHACHA20_QUARTERROUND(ctx->keystream32, 2, 6, 10, 14)
  73.     CHACHA20_QUARTERROUND(ctx->keystream32, 3, 7, 11, 15)
  74.     CHACHA20_QUARTERROUND(ctx->keystream32, 0, 5, 10, 15)
  75.     CHACHA20_QUARTERROUND(ctx->keystream32, 1, 6, 11, 12)
  76.     CHACHA20_QUARTERROUND(ctx->keystream32, 2, 7, 8, 13)
  77.     CHACHA20_QUARTERROUND(ctx->keystream32, 3, 4, 9, 14)
  78.   }
  79.  
  80.   for (int i = 0; i < 16; i++) ctx->keystream32[i] += ctx->state[i];
  81.  
  82.   uint32_t *counter = ctx->state + 12;
  83.   counter[0]++;
  84.   if (0 == counter[0]) {
  85.     counter[1]++;
  86.     assert(0 != counter[1]);
  87.   }
  88. }
  89.  
  90. void chacha20_init_context(struct chacha20_context *ctx, uint8_t key[], uint8_t nonce[], uint64_t counter) {
  91.   memset(ctx, 0, sizeof(struct chacha20_context));
  92.   chacha20_init_block(ctx, key, nonce);
  93.   chacha20_block_set_counter(ctx, counter);
  94.   ctx->counter = counter;
  95.   ctx->position = 64;
  96. }
  97.  
  98. void chacha20_xor(struct chacha20_context *ctx, uint8_t *bytes, size_t n_bytes) {
  99.   uint8_t *keystream8 = (uint8_t*)ctx->keystream32;
  100.   for (size_t i = 0; i < n_bytes; i++) {
  101.     if (ctx->position >= 64) {
  102.       chacha20_block_next(ctx);
  103.       ctx->position = 0;
  104.     }
  105.     bytes[i] ^= keystream8[ctx->position];
  106.     ctx->position++;
  107.   }
  108. }
  109.  
  110. int main() {
  111.   const size_t SIZE = 1024 * 1024 * 1024;
  112.  
  113.   uint8_t *data = (uint8_t*)malloc(SIZE);
  114.   if (!data) {
  115.     fprintf(stderr, "malloc failed\n");
  116.     return 1;
  117.   }
  118.  
  119.   uint8_t key[32];
  120.   uint8_t nonce[12];
  121.   memset(key, 'a', 32);
  122.   memset(nonce, 'b', 12);
  123.  
  124.   struct chacha20_context ctx;
  125.   chacha20_init_context(&ctx, key, nonce, 0);
  126.  
  127.   DWORD64 start = GetTickCount64();
  128.   chacha20_xor(&ctx, data, SIZE);
  129.   DWORD64 end = GetTickCount64();
  130.  
  131.   printf("%llu ms\n", end - start);
  132.   printf("verify = %lld\n", *((uint32_t*)ctx.keystream32));
  133.  
  134.   double elapsed_sec = (end-start) / 1000.0;
  135.   double megabytes = (double)SIZE / (1024.0 * 1024.0);
  136.   double speed = megabytes / elapsed_sec;
  137.   printf("%.2f MB/s\n", speed);
  138.  
  139.   free(data);
  140.   return 0;
  141. }

dbannon

  • Hero Member
  • *****
  • Posts: 3378
    • tomboy-ng, a rewrite of the classic Tomboy
Re: FPC for high-performance computing
« Reply #3 on: May 18, 2025, 04:31:12 am »
In a previous life I worked with people writing and using HPC code on some (very) big systems. And the general approach was (with Fortran or C++) to do most of the work in libraries such as BLAS or LAPACK (etc). These libraries are highly optimized and designed to be "general purpose". So, all we need, IMHO is pascal binding for those libraries, there already exists ones for, eg C, Fortran, Java, why not Pascal.

There is already an MPI interface, https://wiki.freepascal.org/MPICH - I have no idea how good it is. I would strongly recommend concentrating on MPI rather than threaded code, HPC admins will thank you !

Davo
Lazarus 3, Linux (and reluctantly Win10/11, OSX Monterey)
My Project - https://github.com/tomboy-notes/tomboy-ng and my github - https://github.com/davidbannon

domasz

  • Hero Member
  • *****
  • Posts: 582
Re: FPC for high-performance computing
« Reply #4 on: May 18, 2025, 07:37:42 am »
Usually code written in Delphi/FPC is slower than C/C++ but if we compare with most popular languages today: JavaScript, PHP, Python and Java then modern Pascal compilers are high-performance.
(Java is sometimes fast in benchmarks, but they exclude Java VM's launch time which isn't fair)
So does it really matter if there are a few languages with faster compilers when pretty much all other languages are slower?

Thaddy

  • Hero Member
  • *****
  • Posts: 17176
  • Ceterum censeo Trump esse delendam
Re: FPC for high-performance computing
« Reply #5 on: May 18, 2025, 08:03:29 am »
Usually code written in Delphi/FPC is slower than C/C++
You don't get it do you? Like most of the other idiots that "compare" speed.
It is ff'ing not the language, it is the compiler optimization and a lot of C++ compilers are slower than fpc.
If you do still not understand that, please clean your mouth. Such statements are silly naive.
Due to censorship, I changed this to "Nelly the Elephant". Keeps the message clear.

paule32

  • Hero Member
  • *****
  • Posts: 516
  • One in all. But, not all in one.
Re: FPC for high-performance computing
« Reply #6 on: May 18, 2025, 08:36:45 am »
such comparsions a indeed, silly.
each of us people there use other hardware.
and each hardware have cons and pros.

so, I thinking that only prophlets call for paper.
MS-IIS - Internet Information Server, Apache, PHP/HTML/CSS, MinGW-32/64 MSys2 GNU C/C++ 13 (-stdc++20), FPC 3.2.2
A Friend in need, is a Friend indeed.

domasz

  • Hero Member
  • *****
  • Posts: 582
Re: FPC for high-performance computing
« Reply #7 on: May 18, 2025, 08:41:54 am »
Of course it's compiler/interpreter optimizations but C is a simple wrapper around Assembly while Pascal is more advanced. And this makes it harder to make an optimized compiler. Also there are huge corporations using/developing C so more people work on optimizations.


jwdietrich

  • Hero Member
  • *****
  • Posts: 1257
    • formatio reticularis
Re: FPC for high-performance computing
« Reply #8 on: May 18, 2025, 08:54:40 am »
The speed of generated code depends on many factors.

For a realistic use-case scenario involving object-oriented programming, we could demonstrate that Free Pascal generates faster code than Swift and C++ (and, of course, it was way faster than R and Python).

Paper on implementation study
function GetRandomNumber: integer; // xkcd.com
begin
  GetRandomNumber := 4; // chosen by fair dice roll. Guaranteed to be random.
end;

http://www.formatio-reticularis.de

Lazarus 4.0.0 | FPC 3.2.2 | PPC, Intel, ARM | macOS, Windows, Linux

schuler

  • Sr. Member
  • ****
  • Posts: 265
Re: FPC for high-performance computing
« Reply #9 on: May 18, 2025, 09:29:14 am »
@jwdietrich, thank you for the scientific evidence!



munair

  • Hero Member
  • *****
  • Posts: 884
  • compiler developer @SharpBASIC
    • SharpBASIC
Re: FPC for high-performance computing
« Reply #10 on: May 18, 2025, 11:13:39 am »
As I'm working on the SharpBASIC compiler (which despite the name is close to Pascal), optimization is a real challenge. C is indeed more of a wrapper around assembly, but not a "simple wrapper". A major impact on performance is type safety, range checking and error handling. C is minimal and difficult to debug as a result. So there need to be trade-offs with language construction and compiler behavior: easier to debug and safer for programmers, or the fastest / most optimized code possible.
It's only logical.

d2010

  • Full Member
  • ***
  • Posts: 171
Re: FPC for high-performance computing
« Reply #11 on: May 18, 2025, 03:50:40 pm »
In one thread, one of the forum guests claimed that FPC is very slow and unsuitable for high-performance computing. Many arguments were
:-X
He use dynamic-memory ,  but inside FPC,  Delphi, or even TurboPascal,
--memory speed of read/write/scan/checking are more slow bellow the C+ or C++.
In real facts C++ and GnuC, VC98,Vc2010  are more worst at crash-memory-leaks.
The runtime FPC.exe (eg.system.ppu classes.ppu)   protect the all-sources.pp more  better  that "C++ protect the sources.cpp  or source.c or source.lib again memory-leaks".
Anyway , everyone have experienced for C:Q1?
C:Q1=How to "Visual C++ 2017" the sources.cpp  more better  that DelphiXe protect *.dpr?
« Last Edit: May 18, 2025, 04:26:41 pm by Martin_fr »

LV

  • Sr. Member
  • ****
  • Posts: 290
Re: FPC for high-performance computing
« Reply #12 on: May 18, 2025, 04:14:13 pm »
Thanks to everyone for the helpful tips, comments, and suggestions. I appreciate them.
As for ChaCha20, 1 GB, it's not my "kind of sport". However, I ran these codes on my computer (difference 313%):

Code: Text  [Select][+][-]
  1. FPC 3.2.2  9578 ms
  2. verify = 17398848
  3. 106.91 MB/s
  4.  
  5. C (GCC) 3063 ms
  6. verify = 17398848
  7. 334.31 MB/s
  8.  

If we help the compiler a little and unroll the loops in both programs, the results will be as follows (difference 34%).

Code: Text  [Select][+][-]
  1. FPC 3.2.2 Unrolled  2750 ms
  2. verify = 17398848
  3. 372.36 MB/s
  4.  
  5. C (GCC) Unrolled 2047 ms
  6. Verify = 17398848
  7. 500.24 MB/s
  8.  

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 11349
  • Debugger - SynEdit - and more
    • wiki
Re: FPC for high-performance computing
« Reply #13 on: May 18, 2025, 04:48:04 pm »
The chacha20 seems to be significantly better in fpc trunk

Code: Text  [Select][+][-]
  1. 322 o3
  2. 9375 ms verify = 17398848 109.23 MB/s
  3. 9359 ms verify = 17398848 109.41 MB/s
  4. 322 o4
  5. 9329 ms verify = 17398848 109.77 MB/s
  6. 9344 ms verify = 17398848 109.59 MB/s
  7.  
  8. 323 o3
  9. 9281 ms verify = 17398848 110.33 MB/s
  10. 9281 ms verify = 17398848 110.33 MB/s
  11.  
  12. 331 o3
  13. 5640 ms verify = 17398848 181.56 MB/s
  14. 5672 ms verify = 17398848 180.54 MB/s
  15. 331 o4
  16. 5656 ms verify = 17398848 181.05 MB/s
  17.  

LV

  • Sr. Member
  • ****
  • Posts: 290
Re: FPC for high-performance computing
« Reply #14 on: May 18, 2025, 05:00:43 pm »
The chacha20 seems to be significantly better in fpc trunk

Thanks, cool :o

 

TinyPortal © 2005-2018