Recent

Author Topic: Programming language comparison by implementing the same transit data app  (Read 10381 times)

Leledumbo

  • Hero Member
  • *****
  • Posts: 8746
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: Programming language comparison by implementing the same transit data app
« Reply #30 on: November 19, 2022, 01:56:24 pm »
Thank you. It seems possible to squeeze a little more out of TCSVDoc, at least my Windows version of this
<code intentionally skipped for brevity>
shows a 7-8 percent performance improvement.
Yep, another 200ms improvement. Getting closer to C#. I still think we can do something on the else part of that case, as it will be the most frequent match, so basically we check for pCell's nil-ity on every iteration, which is not so good.
« Last Edit: November 19, 2022, 01:59:11 pm by Leledumbo »

avk

  • Hero Member
  • *****
  • Posts: 752
Re: Programming language comparison by implementing the same transit data app
« Reply #31 on: November 19, 2022, 05:23:19 pm »
<...>
I still think we can do something on the else part of that case, as it will be the most frequent match, so basically we check for pCell's nil-ity on every iteration, which is not so good.

Something like this?
Code: Pascal  [Select][+][-]
  1. procedure TCsvDoc.LoadFromFile(const aFileName: string);
  2. var
  3.   ms: specialize TGAutoRef<TMemoryStream>;
  4.   p, pCell, pEnd: pChar;
  5.   Size: SizeInt;
  6.   Row: TStrList;
  7.   pItem: TStrList.PItem;
  8. begin
  9.   ms.Instance.LoadFromFile(aFileName);
  10.   if FCells = nil then
  11.     FCells := TStrList2D.Create;
  12.   FCells.Clear;
  13.   p := ms.Instance.Memory;
  14.   pEnd := p + ms.Instance.Size;
  15.   pCell := p;
  16.   Size := -1;
  17.   Row := nil;
  18.  
  19.   while p < pEnd do begin
  20.     case p^ of
  21.       ',':
  22.         begin
  23.           if Row = nil then begin
  24.             Row := TStrList.Create(Max(Size, DEFAULT_CONTAINER_CAPACITY));
  25.             FCells.Add(Row);
  26.           end;
  27.           pItem := Row.UncMutable[Row.Add('')];
  28.           if p - pCell > 0 then begin
  29.             SetLength(pItem^, p - pCell);
  30.             Move(pCell^, PChar(pItem^)^, p - pCell);
  31.           end;
  32.           pCell := p + 1;
  33.         end;
  34.       #10, #13:
  35.         if Row <> nil then begin
  36.           pItem := Row.UncMutable[Row.Add('')];
  37.           if p - pCell > 0 then begin
  38.             SetLength(pItem^, p - pCell);
  39.             Move(pCell^, PChar(pItem^)^, p - pCell);
  40.           end;
  41.           if Size = -1 then
  42.             Size := Row.Count;
  43.           Row := nil;
  44.           pCell := p + 1;
  45.           Inc(pCell, Ord(pCell^ in [#10, #13]));
  46.         end;
  47.     end;
  48.     Inc(p);
  49.   end;
  50.   if Row = nil then exit;
  51.   pItem := Row.UncMutable[Row.Add('')];
  52.   SetLength(pItem^, p - pCell);
  53.   Move(pCell^, PChar(pItem^)^, p - pCell);
  54. end;
  55.  

Leledumbo

  • Hero Member
  • *****
  • Posts: 8746
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: Programming language comparison by implementing the same transit data app
« Reply #32 on: November 20, 2022, 10:04:46 pm »
Something like this?
Yep, but I'm surprised the improvement is very little, despite consistent. 3 times call result, old code:
Code: [Select]
$ ./app
parsed 1790905 stop times in 3.389 seconds
parsed 71091 trips in 133.000ms
$ ./app
parsed 1790905 stop times in 3.396 seconds
parsed 71091 trips in 133.000ms
$ ./app
parsed 1790905 stop times in 3.386 seconds
parsed 71091 trips in 131.000ms
New code:
Code: [Select]
$ ./app
parsed 1790905 stop times in 3.373 seconds
parsed 71091 trips in 128.000ms
$ ./app
parsed 1790905 stop times in 3.391 seconds
parsed 71091 trips in 128.000ms
$ ./app
parsed 1790905 stop times in 3.378 seconds
parsed 71091 trips in 127.000ms
Nevertheless, I think this is good enough. I'm working on the HTTP server as I've found Fundamentals library (both version 4 and 5) also has it, but mORMot 2 might be the first I would try, as mORMot (1) has been very concerned about benchmarks and performance.

Leledumbo

  • Hero Member
  • *****
  • Posts: 8746
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: Programming language comparison by implementing the same transit data app
« Reply #33 on: November 23, 2022, 12:44:03 pm »
I've successfully turned the server into mORMot 2 one, however, the slowness remains. So I tried localizing to find bottlenecks.
I count elapsed time between 2 operations: a call to BuildTripResponse and JSONification of the response, using the last 5 parameters from the loadTest.js, with shuffleArray disabled.

Here's the result from Go version:
Code: [Select]
buildTripResponse: 1.324407ms
jsonify: 16.314437ms
buildTripResponse: 912.183µs
jsonify: 10.437676ms
buildTripResponse: 779.098µs
jsonify: 8.114551ms
buildTripResponse: 103.812µs
jsonify: 1.047072ms
buildTripResponse: 68.561µs
jsonify: 953.276µs
and Pascal version:
Code: [Select]
BuildTripResponse: 1ms
JSONify: 36ms
BuildTripResponse: 4ms
JSONify: 90ms
BuildTripResponse: 3ms
JSONify: 73ms
BuildTripResponse: 7ms
JSONify: 176ms
BuildTripResponse: 23ms
JSONify: 581ms
or maybe in table format:
Code: [Select]
+------------+----------------------+-----------------------+-----------------+----------------+
| request no | Go buildTripResponse | FPC BuildTripResponse | Go json.Marshal | FPC FormatJSON |
+------------+----------------------+-----------------------+-----------------+----------------+
|          1 |               1.32ms |                   1ms |         16.31ms |           36ms |
|          2 |               0.91ms |                   4ms |         10.44ms |           90ms |
|          3 |               0.78ms |                   3ms |          8.11ms |           73ms |
|          4 |               0.10ms |                   7ms |          1.05ms |          176ms |
|          5 |               0.07ms |                  23ms |          0.95ms |          581ms |
The JSON-ification is clearly a bottleneck, partly due to my way of coding it as it expects array response at the top instead of object, and there's no ArrayToJSON, only ObjectToJSON available, hence my manual per object method. but the BuildTripResponse is not good either, losing up to 328x (the last result).

abouchez

  • Full Member
  • ***
  • Posts: 110
    • Synopse
Re: Programming language comparison by implementing the same transit data app
« Reply #34 on: November 23, 2022, 03:53:10 pm »
There are ArrayToJson-like methods in mORMot.

Use DynArrayLoadJson() and DynArraySaveJson() in mormot.core.json.

Ensure you defined the embedded records as packed, and define the fields as text calling Rtti.RegisterFromText() once at startup.

You could share your code on a gist (or somewhere else) so that we could look at it and optimized it to match mORMot's best performance needs.
« Last Edit: November 23, 2022, 03:55:41 pm by abouchez »

avk

  • Hero Member
  • *****
  • Posts: 752
Re: Programming language comparison by implementing the same transit data app
« Reply #35 on: November 23, 2022, 03:58:23 pm »
....
Code: [Select]
+------------+----------------------+-----------------------+-----------------+----------------+
| request no | Go buildTripResponse | FPC BuildTripResponse | Go json.Marshal | FPC FormatJSON |
+------------+----------------------+-----------------------+-----------------+----------------+
|          1 |               1.32ms |                   1ms |         16.31ms |           36ms |
|          2 |               0.91ms |                   4ms |         10.44ms |           90ms |
|          3 |               0.78ms |                   3ms |          8.11ms |           73ms |
|          4 |               0.10ms |                   7ms |          1.05ms |          176ms |
|          5 |               0.07ms |                  23ms |          0.95ms |          581ms |
...

IIRC, the fastest way to stringify TJSONData is TJSONData.DumpJSON().

Leledumbo

  • Hero Member
  • *****
  • Posts: 8746
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: Programming language comparison by implementing the same transit data app
« Reply #36 on: November 23, 2022, 10:45:34 pm »
You could share your code on a gist (or somewhere else) so that we could look at it and optimized it to match mORMot's best performance needs.
I've pushed it to the repo, with all the elapsed time counters. Feel free to dig in and find things to optimize. Btw, my test still shows that mORMot HTTP server bandwidth is not that much faster from fphttpserver, even similar. So maybe you can also take a look at that.

Btw, I copied only needed files from mormot2 repository and not using the Lazarus packages. Hope you already have a directory with all precompiled units and static assets.

abouchez

  • Full Member
  • ***
  • Posts: 110
    • Synopse
Re: Programming language comparison by implementing the same transit data app
« Reply #37 on: November 24, 2022, 09:57:44 am »
I just found the repo at https://github.com/leledumbo/transit-lang-cmp
*Edit*: there are .o and .ppu files included in the repo. :(

In fact, there are several HTTP servers.
THttpAsyncServer is meant for thousands of concurrent kept-alive clients.
THttpServer is socket-based, and use one thread per kept-alive client.
THttpApiServer is Windows only, and use the http.sys API.

For 50 concurrent clients, on Windows, THttpApiServer is likely to be the fastest, and perhaps THttpServer also would have good numbers.

From what I can see, the main bottleneck is not the server, but the trip/route process.  A lot of allocations, and I am not sure that using generics is the fastest path.
I will try to reimplement it in a mORMot way.
« Last Edit: November 24, 2022, 09:59:21 am by abouchez »

Leledumbo

  • Hero Member
  • *****
  • Posts: 8746
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: Programming language comparison by implementing the same transit data app
« Reply #38 on: November 24, 2022, 04:49:41 pm »
*Edit*: there are .o and .ppu files included in the repo. :(
Dang, must be accidentally added when committing.
In fact, there are several HTTP servers.
THttpAsyncServer is meant for thousands of concurrent kept-alive clients.
THttpServer is socket-based, and use one thread per kept-alive client.
THttpApiServer is Windows only, and use the http.sys API.

For 50 concurrent clients, on Windows, THttpApiServer is likely to be the fastest, and perhaps THttpServer also would have good numbers.
As the load test does concurrent attacks, I choose the first one.
From what I can see, the main bottleneck is not the server, but the trip/route process.  A lot of allocations, and I am not sure that using generics is the fastest path.
I will try to reimplement it in a mORMot way.
Please do, but please also compare with the original Go code from which I converted the code from.

abouchez

  • Full Member
  • ***
  • Posts: 110
    • Synopse
Re: Programming language comparison by implementing the same transit data app
« Reply #39 on: November 25, 2022, 09:14:38 am »
Please check https://github.com/synopse/mORMot2/tree/master/ex/lang-cmp

This is the "mORMot way" of implementing the language comparison server.
- It uses mORMot RTTI to parse the CSV into records, and generate JSON from the results.
- We did not use a map/dictionary, but TDynArray feature of binary searching of sorted data.
- By default, the CSV parse will "intern" all values to de-duplicate strings: it is slightly slower (around 30% I guess), but it reduces the memory a lot, and also speed up comparisons of identical strings (they both share the same pointer, so no need to compare the characters).

From what I can see, performance should be much better that the current pascal version.
One amazing difference with other implementations is the memory consumption. A running server consume less than 70MB or RAM on my PC with all the data loaded - thanks to interning: it is 380 MB otherwise, which is much lower than alternatives anyway.
Please compare with GO and RUST on your machine.

For a given schedule:
Code: [Select]
ab@ab:~/dev/github/mORMot2$ wrk -c 50 -d 15s -t 4 http://localhost:4000/schedules/354
Running 15s test @ http://localhost:4000/schedules/354
  4 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    16.54ms    4.18ms  53.47ms   91.70%
    Req/Sec   728.48     62.13   840.00     72.00%
  43545 requests in 15.02s, 8.96GB read
Requests/sec:   2898.99
Transfer/sec:    611.09MB

For a void result (i.e. a non existing schedule which return only [] JSON result):
Code: [Select]
ab@ab :~/dev/github/mORMot2$ wrk -c 50 -d 15s -t 4 http://localhost:4000/schedules/7777777
Running 15s test @ http://localhost:4000/schedules/7777777
  4 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   323.12us  396.82us  20.32ms   97.96%
    Req/Sec    34.93k     1.70k   41.89k    72.83%
  2086314 requests in 15.01s, 240.75MB read
Requests/sec: 138980.30
Transfer/sec:     16.04MB

Leledumbo

  • Hero Member
  • *****
  • Posts: 8746
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: Programming language comparison by implementing the same transit data app
« Reply #40 on: November 25, 2022, 11:53:15 am »
Please check https://github.com/synopse/mORMot2/tree/master/ex/lang-cmp

This is the "mORMot way" of implementing the language comparison server.
Impressive loading time on my machine:
Code: [Select]
parsed 1790905 stop times in 972.85ms
parsed 71091 trips in 45.05ms
- It uses mORMot RTTI to parse the CSV into records, and generate JSON from the results.
I see you just added the CSV loading functionality and boy, mORMot's RTTI is surely powerful.
- We did not use a map/dictionary, but TDynArray feature of binary searching of sorted data.
- By default, the CSV parse will "intern" all values to de-duplicate strings: it is slightly slower (around 30% I guess), but it reduces the memory a lot, and also speed up comparisons of identical strings (they both share the same pointer, so no need to compare the characters).
Lots of pointers, but I guess that what makes it fast, compared to the convenience of generics.
From what I can see, performance should be much better that the current pascal version.
One amazing difference with other implementations is the memory consumption. A running server consume less than 70MB or RAM on my PC with all the data loaded - thanks to interning: it is 380 MB otherwise, which is much lower than alternatives anyway.
Please compare with GO and RUST on your machine.
Done:
Code: [Select]
running (0m34.7s), 00/50 VUs, 296 complete and 0 interrupted  | running (0m31.8s), 00/50 VUs, 308 complete and 0 interrupted
default ✓ [======================================] 50 VUs  30   default ✓ [======================================] 50 VUs  30

     data_received..................: 27 GB  763 MB/s         |      data_received..................: 30 GB  948 MB/s
     data_sent......................: 2.8 MB 79 kB/s          |      data_sent......................: 2.9 MB 90 kB/s
     http_req_blocked...............: avg=7.46µs  min=1.09µs  |      http_req_blocked...............: avg=7.78µs  min=1.04µs
     http_req_connecting............: avg=1.17µs  min=0s      |      http_req_connecting............: avg=2.71µs  min=0s     
     http_req_duration..............: avg=56.71ms min=73.52µs |      http_req_duration..............: avg=50.84ms min=143.5µs
       { expected_response:true }...: avg=56.71ms min=73.52µs |        { expected_response:true }...: avg=50.84ms min=143.5µs
     http_req_failed................: 0.00%  ✓ 0          ✗ 2 |      http_req_failed................: 0.00%  ✓ 0          ✗ 3
     http_req_receiving.............: avg=24.56ms min=16.13µs |      http_req_receiving.............: avg=7.9ms   min=15.58µs
     http_req_sending...............: avg=23.58µs min=5.41µs  |      http_req_sending...............: avg=54.32µs min=5.56µs
     http_req_tls_handshaking.......: avg=0s      min=0s      |      http_req_tls_handshaking.......: avg=0s      min=0s     
     http_req_waiting...............: avg=32.12ms min=45.78µs |      http_req_waiting...............: avg=42.88ms min=98.37µs
     http_reqs......................: 29304  844.805816/s     |      http_reqs......................: 30492  958.247358/s
     iteration_duration.............: avg=5.62s   min=4.38s   |      iteration_duration.............: avg=5.04s   min=2s     
     iterations.....................: 296    8.533392/s       |      iterations.....................: 308    9.679266/s
     vus............................: 13     min=13       max |      vus............................: 30     min=30       max
     vus_max........................: 50     min=50       max        vus_max........................: 50     min=50       max
Go version still runs a little bit faster, but not by much. I still attribute this to their higher HTTP server download bandwidth for some reason.

I guess it's better for your version to be submitted instead of my lackluster one.

abouchez

  • Full Member
  • ***
  • Posts: 110
    • Synopse
Re: Programming language comparison by implementing the same transit data app
« Reply #41 on: November 25, 2022, 12:22:13 pm »
You are right: some pointers, but they are typed pointers, so it is a somewhat safe approach.
Using array indexes with no runtime range checking is as unsafe as inc(typedpointer).

I have just commited a huge LangCmp sample speed-up
https://github.com/synopse/mORMot2/commit/cfdcb7f2
- we use PUtf8Char for JSON responses
- it is fair, since Rust is using &'data str
- string refcounting has a price, even if it does not allocate memory: but it has a thread-safe incremented lock, which is somewhat slow - even slower on ARM
- also made the sample Windows compatible
- should be faster than Go now - please try !

Numbers are pretty good now:
Code: [Select]
ab@ab:~/dev/github/mORMot2$ wrk -c 50 -d 15s -t 4 http://localhost:4000/schedules/354
Running 15s test @ http://localhost:4000/schedules/354
  4 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    10.97ms    2.73ms  38.58ms   92.03%
    Req/Sec     1.10k    70.60     1.30k    70.83%
  65615 requests in 15.01s, 13.51GB read
Requests/sec:   4371.23
Transfer/sec:      0.90GB

To be included on the main repository, performance may be less obvious on the Mac M1 / AARCH64 architecture.
The mORMot HTTP server is less optimized on Mac (use good old poll API there), and there is no optimized asm involved.
But I have seen string interning to be of good benefit on AARCH64, because it avoids most memory allocations.
My problem is that I don't know how to propose a simple way of building the project with no FPC/Lazarus knowledge. ;)

I wonder about the memory consumption of the server on your machine, e.g. in comparison with Go and its GC.
« Last Edit: November 25, 2022, 12:32:13 pm by abouchez »

Fred vS

  • Hero Member
  • *****
  • Posts: 3158
    • StrumPract is the musicians best friend
Re: Programming language comparison by implementing the same transit data app
« Reply #42 on: November 25, 2022, 02:00:20 pm »
@Leledumbo + @abouchez + @avk : WoW!  ;D

Many thanks to show that fpc is still standing and continuing to fight in the ring.

[EDIT] Did you also test with fpc-llvm with -O4 ?

Fre;D
« Last Edit: November 25, 2022, 02:12:30 pm by Fred vS »
I use Lazarus 2.2.0 32/64 and FPC 3.2.2 32/64 on Debian 11 64 bit, Windows 10, Windows 7 32/64, Windows XP 32,  FreeBSD 64.
Widgetset: fpGUI, MSEgui, Win32, GTK2, Qt.

https://github.com/fredvs
https://gitlab.com/fredvs
https://codeberg.org/fredvs

Leledumbo

  • Hero Member
  • *****
  • Posts: 8746
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: Programming language comparison by implementing the same transit data app
« Reply #43 on: November 25, 2022, 11:25:47 pm »
You are right: some pointers, but they are typed pointers, so it is a somewhat safe approach.
Using array indexes with no runtime range checking is as unsafe as inc(typedpointer).

I have just commited a huge LangCmp sample speed-up
https://github.com/synopse/mORMot2/commit/cfdcb7f2
- we use PUtf8Char for JSON responses
- it is fair, since Rust is using &'data str
- string refcounting has a price, even if it does not allocate memory: but it has a thread-safe incremented lock, which is somewhat slow - even slower on ARM
- also made the sample Windows compatible
- should be faster than Go now - please try !
Nice!
Code: [Select]
parsed 1790905 stop times in 968.43ms                         | parsed 1790905 stop times in 3.245251432s
parsed 71091 trips in 39.54ms                                 | parsed 71091 trips in 85.747852ms

running (0m33.4s), 00/50 VUs, 348 complete and 0 interrupted  | running (0m32.3s), 00/50 VUs, 320 complete and 0 interrupted
default ✓ [======================================] 50 VUs  30   default ✓ [======================================] 50 VUs  30

     data_received..................: 31 GB  933 MB/s         |      data_received..................: 31 GB  971 MB/s
     data_sent......................: 3.2 MB 97 kB/s          |      data_sent......................: 3.0 MB 92 kB/s
     http_req_blocked...............: avg=9µs     min=1.09µs  |      http_req_blocked...............: avg=6.77µs  min=1.09µs
     http_req_connecting............: avg=2.95µs  min=0s      |      http_req_connecting............: avg=1.73µs  min=0s     
     http_req_duration..............: avg=47.59ms min=97.28µs |      http_req_duration..............: avg=49.02ms min=123.81µ
       { expected_response:true }...: avg=47.59ms min=97.28µs |        { expected_response:true }...: avg=49.02ms min=123.81µ
     http_req_failed................: 0.00%  ✓ 0           ✗  |      http_req_failed................: 0.00%  ✓ 0          ✗ 3
     http_req_receiving.............: avg=9.66ms  min=15.35µs |      http_req_receiving.............: avg=5.92ms  min=14.76µs
     http_req_sending...............: avg=87.24µs min=5.2µs   |      http_req_sending...............: avg=70.71µs min=5.2µs 
     http_req_tls_handshaking.......: avg=0s      min=0s      |      http_req_tls_handshaking.......: avg=0s      min=0s     
     http_req_waiting...............: avg=37.83ms min=54.74µs |      http_req_waiting...............: avg=43.02ms min=91.84µs
     http_reqs......................: 34452  1032.205528/s    |      http_reqs......................: 31680  981.949476/s
     iteration_duration.............: avg=4.72s   min=3.54s   |      iteration_duration.............: avg=4.86s   min=2.19s 
     iterations.....................: 348    10.426318/s      |      iterations.....................: 320    9.918682/s
     vus............................: 30     min=30        ma |      vus............................: 15     min=15       max
     vus_max........................: 50     min=50        ma |      vus_max........................: 50     min=50       max
To be included on the main repository, performance may be less obvious on the Mac M1 / AARCH64 architecture.
The mORMot HTTP server is less optimized on Mac (use good old poll API there), and there is no optimized asm involved.
But I have seen string interning to be of good benefit on AARCH64, because it avoids most memory allocations.
Yes, I don't have the machine either, so I can't test myself.
My problem is that I don't know how to propose a simple way of building the project with no FPC/Lazarus knowledge. ;)
My local build uses -Fu<mormot2>/src/* and -Fi<mormot2>/src as well as -FU<somewhere else>, this way the compiled units will end up in a single directory instead of alongside their source code.
I wonder about the memory consumption of the server on your machine, e.g. in comparison with Go and its GC.
Upon finished loading the CSV, it only eats 80MB, heck so little. Sounds a bit magical. But during load test, it fluctuates between 250-350MB, upon which it returns to 80MB at the end. I kinda feel like a GC behavior here, or just a proper memory deallocation scheme.
The Go version eats 925MB upon finished loading the CSV. During load test, it tops at 1.5GB, returning to 925MB afterwards.

Leledumbo

  • Hero Member
  • *****
  • Posts: 8746
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: Programming language comparison by implementing the same transit data app
« Reply #44 on: November 26, 2022, 12:15:12 am »
@abouchez:
I've added all the required files from mORMot 2 repo, alongside LICENSE, for easy building reason, also a Makefile for the same reason.
a README.md is provided to mention mORMot 2 and your name, please check and feel free to ask for change if you mind something:
https://github.com/leledumbo/transit-lang-cmp/tree/main/trascal

 

TinyPortal © 2005-2018