So this could be a case study for dynamic arrays. Perhaps the size exceeds maximum?The maximum size for a dynamic array is High(SizeInt). So this differs between 32/64 bit. SizeInt = longint.
I seem to have hit either a bug or limitation of dynamic array. Attached is my current attempt, it crashes when reading stop_times.txt (that file has 1790906 rows!) with the following stack trace:<...>
I guess the reason is mainly in this line:Hell, you're correct, alongside another SetLength in GetTrips. Guess Go doesn't do immediate allocation while FPC does, where it allocates too much. Now it has succeeded, indeed the amount of RAM eaten is almost 2GB (1.8GB to be exact). I'll continue with the benchmark, could be fun to see where FPC sits.
running (0m43.4s), 00/50 VUs, 50 complete and 0 interrupted i | running (0m34.0s), 00/50 VUs, 286 complete and 0 interrupted
default ✓ [======================================] 50 VUs 30 default ✓ [======================================] 50 VUs 30
data_received..................: 658 MB 15 MB/s | data_received..................: 28 GB 823 MB/s
data_sent......................: 106 kB 2.4 kB/s | data_sent......................: 2.7 MB 78 kB/s
http_req_blocked...............: avg=6.81ms min=0s | http_req_blocked...............: avg=7.66µs min=1.08µs
http_req_connecting............: avg=6.38ms min=0s | http_req_connecting............: avg=2.12µs min=0s
http_req_duration..............: avg=428.67ms min=0s | http_req_duration..............: avg=57.8ms min=98.58µs
{ expected_response:true }...: avg=2.32s min=1.04ms | { expected_response:true }...: avg=57.8ms min=98.58µs
http_req_failed................: 85.63% ✓ 4239 ✗ 7 | http_req_failed................: 0.00% ✓ 0 ✗ 2
http_req_receiving.............: avg=1.96ms min=0s | http_req_receiving.............: avg=7.16ms min=15.06µs
http_req_sending...............: avg=484.72µs min=0s | http_req_sending...............: avg=42.78µs min=5.11µs
http_req_tls_handshaking.......: avg=0s min=0s | http_req_tls_handshaking.......: avg=0s min=0s
http_req_waiting...............: avg=426.23ms min=0s | http_req_waiting...............: avg=50.59ms min=74.66µs
http_reqs......................: 4950 114.065486/s | http_reqs......................: 28314 831.923149/s
iteration_duration.............: avg=43.35s min=43.25s | iteration_duration.............: avg=5.73s min=3.91s
iterations.....................: 50 1.152177/s | iterations.....................: 286 8.403264/s
vus............................: 50 min=50 max | vus............................: 4 min=4 max
vus_max........................: 50 min=50 max vus_max........................: 50 min=50 max
And I've got so many connection refused (you see that http_req_failed? that means only almost 15% of all requests are successfully handled).Lastly, fphttpserver is really not meant for high performance, I remember Marco said so a couple of years ago. So maybe I would need an alternative implementation configured for performance.
Latest code can be found on my fork (https://github.com/leledumbo/transit-lang-cmp/tree/main/trascal) of the original repo. I don't think I will make a PR yet, until the performance is satisfactory.
<...>
Guess Go doesn't do immediate allocation while FPC does, where it allocates too much.
<...>
Maybe base it on mORMot2 (https://github.com/synopse/mORMot2) instead?I'll check if this example (https://github.com/synopse/mORMot2/blob/master/ex/http-server-raw/httpServerRaw.dpr) is enough to take as a base.
It's just that in the go version, stopTimes is not an array, but a slice, the capacity of which can be increased on demand.All of my arrays are dynamic as well, which should be equivalent to Go slice.
BTW, on my machine TCSVDocument parses the file stop_times.txt in about 14 seconds. I suspect that even if you load the file into a TStringList and then parse it line by line using string.Split(), things will go about 6-7 times faster.Either that or I'll just stream it along the way, no idea which one is faster. But still I think TCSVDocument is just not made with performance in mind (plus that Cells property is kinda arrrgh because it puts column first instead of row first).
running (0m52.6s), 00/15 VUs, 15 complete and 0 interrupted i | running (0m30.8s), 00/15 VUs, 301 complete and 0 interrupted
default ✓ [======================================] 15 VUs 30 default ✓ [======================================] 15 VUs 30
data_received..................: 1.5 GB 28 MB/s | data_received..................: 30 GB 956 MB/s
data_sent......................: 277 kB 5.3 kB/s | data_sent......................: 2.8 MB 91 kB/s
http_req_blocked...............: avg=139.3µs min=66.52µ | http_req_blocked...............: avg=4.79µs min=1.06µs
http_req_connecting............: avg=2.64µs min=0s | http_req_connecting............: avg=725ns min=0s
http_req_duration..............: avg=526.63ms min=2.51ms | http_req_duration..............: avg=15.24ms min=85.57µs
{ expected_response:true }...: avg=526.63ms min=2.51ms | { expected_response:true }...: avg=15.24ms min=85.57µs
http_req_failed................: 0.00% ✓ 0 ✗ 14 | http_req_failed................: 0.00% ✓ 0 ✗ 2
http_req_receiving.............: avg=1.84ms min=34.02µ | http_req_receiving.............: avg=1.41ms min=12.72µs
http_req_sending...............: avg=19.56ms min=19.47µ | http_req_sending...............: avg=49.25µs min=4.83µs
http_req_tls_handshaking.......: avg=0s min=0s | http_req_tls_handshaking.......: avg=0s min=0s
http_req_waiting...............: avg=505.22ms min=181.48 | http_req_waiting...............: avg=13.78ms min=54.7µs
http_reqs......................: 1485 28.236484/s | http_reqs......................: 29799 966.969529/s
iteration_duration.............: avg=52.14s min=51.45s | iteration_duration.............: avg=1.51s min=898.64m
iterations.....................: 15 0.285217/s | iterations.....................: 301 9.767369/s
vus............................: 11 min=11 max= | vus............................: 15 min=15 max
vus_max........................: 15 min=15 max= | vus_max........................: 15 min=15 max
- Only one last byte: the HTTP server. I just realized that the number of iterations = number of VUs in previous failed test and last successful one, seems like the connection ain't properly closed after sending the response, making each VU stuck and cannot issue another request.
EDIT: forgot to commit csvutils.pas, it's now using low level implementation that pushes reading stop_times.txt to another 33% (7s -> 5s), I have no idea what technique employed by Go to reach that 2s. Impressive stuff.
EDIT: forgot to commit csvutils.pas, it's now using low level implementation that pushes reading stop_times.txt to another 33% (7s -> 5s), I have no idea what technique employed by Go to reach that 2s. Impressive stuff.Curious, which platform did you test on? I just ran your latest version against Go in a Linux x86-64 virtual machine and got a slightly different result:
You could try with FPC 3.3.1 as there were some fixes (https://gitlab.com/freepascal.org/fpc/source/-/commit/a1a30876d596e9bca2a5409b53b0fc637eda5dfd) regarding the closing of sockets recently (don't know if that is what affects you however).I do use 3.3.1, updated like 3 days ago or so. Let me update again.
What is the time if you simply read the file with TFileStream (discarding the data, only the raw read time)?You mean only the
Curious, which platform did you test on? I just ran your latest version against Go in a Linux x86-64 virtual machine and got a slightly different result:Same Linux x86_64, but on a real machine, i7-7700HQ, DDR4-2400 dual channel, Sandisk Extreme Portable SSD V2 500GB over USB 3.0.go version: go1.19.3 linux/amd64
go: parsed 1739278 stop times in 2.3342127s fpc: parsed 1739278 stop times in 2.470 seconds
fpc version: 3.3.1-12058-g9b6926c5f5 linux x86_64
Same Linux x86_64, but on a real machine, i7-7700HQ, DDR4-2400 dual channel, Sandisk Extreme Portable SSD V2 500GB over USB 3.0.
EDIT: Optimized by replacing direct dynamic array with TVector which has quite efficient growth factor, I got additional 250ms.
Hmm, then it's even more curious, my virtual Linux is running on an ancient i3-4150, HDD.That's sick, how come it beats my 4 generations younger CPU with SSD?
It seems even more:Same reason as above, I think.
parsed 1739278 stop times in 2.124 seconds
I may have missed something, but what are the optimization settings for Go and FPC? And which linker is used?Go: none, they don't have individual optimizations switch, only all (default) or none. FPC uses -CX -XXs -O3 (tried -O4, doesn't matter much). Linker also default for both.
Futhermore, language comparisons are futile, since what really matters is the compiler and linker implementation.Sure, I do realize this, that's why I said FPC, not Pascal. I still want to know where FPC lies in the performance realm with more mainstream languages implementation. I consider only Google Go's gc, ignoring gccgo and gollvm despite all 3 are officially supported.
In principle there is no such thing as one language being faster than some other language. It is all about how a compiler generates efficient code and a linker can further optimize.
The above is a constant sorrow and shows not many people get how it really works....
Futhermore, language comparisons are futile, since what really matters is the compiler and linker implementation.
Don't forget about libraries too.Libraries by themselves are by their very nature not part of it: they have nothing to do with code generation.
BTW, lines 245-246 of app.pas look suspicious. Seems like it should be:Nice catch, it might be a problem for correctness, not performance though.
LStopTimesIx.Add(i - 1); AStopTimes[i - 1] := TStopTime.Create(LTrip, LCSV.Cells[3, i], LCSV.Cells[1, i], LCSV.Cells[2, i]);
There is another question: is it really necessary to fully parse a multi-megabyte CSV document if only its first 4 columns are needed?My expectation is to use a generic (in the sense of not specifically tailored for this needs) csv loading code, because the Go version also uses their generic encoding/csv package. Hence, whole parsing still needs to be done. Using TMemoryStream might be a good idea since FPC doesn't yet have optimizations for array indexing with loop variables (I requested years ago, but FPK only came up with a showcase but never really committed to the repo, AFAIR) which I believe, Go's gc has. So using explicit pointer is the way to go.
For example, this versionworks out in 1.317 s, and if primitives from LGenerics are used, then in 0.810 s.
... type ... TIntList = specialize TList<Integer>; TStringIntListMap = specialize TObjectDictionary<string, TIntList>; TStopTimeDynArr = specialize TObjectList<TStopTime>; ... procedure GetStopTimes(out AStopTimes: TStopTimeDynArr; out AStopTimesIxByTrip: TStringIntListMap); type TKeySet = array[0..3] of string; procedure ParseLine(p, pEnd: PChar; out Keys: TKeySet); var Idx: Integer; pStart: PChar; begin pStart := p; Idx := 0; while p <= pEnd do begin if p^ = ',' then begin SetLength(Keys[Idx], p - pStart); Move(pStart^, Keys[Idx][1], p - pStart); if Idx = 3 then exit; Inc(Idx); pStart := p+1; end; Inc(p); end; end; var ms: TMemoryStream; LStart,LEnd: TDateTime; p, pStart, pStop: PChar; LStopTimesIx: TIntList; k: TKeySet; HeaderOk: Boolean = False; begin ms := TMemoryStream.Create; try LStart := Now; ms.LoadFromFile('../MBTA_GTFS/stop_times.txt'); p := ms.Memory; pStop := p + ms.Size; pStart := nil; AStopTimesIxByTrip := TStringIntListMap.Create([doOwnsValues]); AStopTimes := TStopTimeDynArr.Create; while p < pStop do begin if p^ in [#10, #13] then if pStart <> nil then begin ParseLine(pStart, p - 1, k); if HeaderOk then begin if not AStopTimesIxByTrip.TryGetValue(k[0], LStopTimesIx) then begin LStopTimesIx := TIntList.Create; AStopTimesIxByTrip.Add(k[0], LStopTimesIx); end; LStopTimesIx.Add(AStopTimes.Count); AStopTimes.Add(TStopTime.Create(k[0], k[3], k[1], k[2])); end else begin if (k[0] <> 'trip_id') or (k[3] <> 'stop_id') or (k[1] <> 'arrival_time') or (k[2] <> 'departure_time') then begin WriteLn('stop_times.txt not in expected format.'); Halt(1); end; HeaderOk := True; end; pStart := nil; end else else if pStart = nil then pStart := p; Inc(p); end; finally ms.Free; end; LEnd := Now; WriteLn('parsed ', AStopTimes.Count, ' stop times in ', SecondSpan(LStart, LEnd):1:3,' seconds'); end; ...
Libraries by themselves are by their very nature not part of it: they have nothing to do with code generation.
Modifying csvutils.TCSVDocument.LoadFromFile(const AFileName: String); to pre compute number of rows and do a single initial SetLength(FCells, lRows) seem to improve GetStopTimes by roughly 20 %Still slower than the latest commit which is based on TVector:
parsed 1790905 stop times in 4600.000ms
parsed 71091 trips in 180.000ms
vs:parsed 1790905 stop times in 5266.000ms
parsed 71091 trips in 211.000ms
starfighers usually crashed.
https://en.wikipedia.org/wiki/Lockheed_F-104_Starfighter
Probably, lgList is not a very good name, this unit contains some special kind of lists: sorted, hashed. You can look into the lgVector unit.Oh, my dear eyes. How the heck they skip "vector" while it's so clear... maybe I shouldn't code after midnight haha.
Although TVector from fcl-stl is quite fast, it seems that this version of CSVDocumentWow! Almost another full second drop on my system!
<code intentionally skipped for brevity>
is 8-9 percent faster than your TVector-based one.
And accordingly, this version of GetStopTimes()
<code intentionally skipped for brevity>
works out in about 1.640 s.
parsed 1790905 stop times in 3.846 seconds
parsed 71091 trips in 155.000ms
This is how the fun in code optimizing should be! Great job, avk!Thank you. It seems possible to squeeze a little more out of TCSVDoc, at least my Windows version of thisYep, another 200ms improvement. Getting closer to C#. I still think we can do something on the else part of that case, as it will be the most frequent match, so basically we check for pCell's nil-ity on every iteration, which is not so good.
<code intentionally skipped for brevity>
shows a 7-8 percent performance improvement.
<...>
I still think we can do something on the else part of that case, as it will be the most frequent match, so basically we check for pCell's nil-ity on every iteration, which is not so good.
Something like this?Yep, but I'm surprised the improvement is very little, despite consistent. 3 times call result, old code:
$ ./app
parsed 1790905 stop times in 3.389 seconds
parsed 71091 trips in 133.000ms
$ ./app
parsed 1790905 stop times in 3.396 seconds
parsed 71091 trips in 133.000ms
$ ./app
parsed 1790905 stop times in 3.386 seconds
parsed 71091 trips in 131.000ms
New code:$ ./app
parsed 1790905 stop times in 3.373 seconds
parsed 71091 trips in 128.000ms
$ ./app
parsed 1790905 stop times in 3.391 seconds
parsed 71091 trips in 128.000ms
$ ./app
parsed 1790905 stop times in 3.378 seconds
parsed 71091 trips in 127.000ms
Nevertheless, I think this is good enough. I'm working on the HTTP server as I've found Fundamentals library (both version 4 and 5) also has it, but mORMot 2 might be the first I would try, as mORMot (1) has been very concerned about benchmarks and performance.
buildTripResponse: 1.324407ms
jsonify: 16.314437ms
buildTripResponse: 912.183µs
jsonify: 10.437676ms
buildTripResponse: 779.098µs
jsonify: 8.114551ms
buildTripResponse: 103.812µs
jsonify: 1.047072ms
buildTripResponse: 68.561µs
jsonify: 953.276µs
and Pascal version:BuildTripResponse: 1ms
JSONify: 36ms
BuildTripResponse: 4ms
JSONify: 90ms
BuildTripResponse: 3ms
JSONify: 73ms
BuildTripResponse: 7ms
JSONify: 176ms
BuildTripResponse: 23ms
JSONify: 581ms
or maybe in table format:+------------+----------------------+-----------------------+-----------------+----------------+
| request no | Go buildTripResponse | FPC BuildTripResponse | Go json.Marshal | FPC FormatJSON |
+------------+----------------------+-----------------------+-----------------+----------------+
| 1 | 1.32ms | 1ms | 16.31ms | 36ms |
| 2 | 0.91ms | 4ms | 10.44ms | 90ms |
| 3 | 0.78ms | 3ms | 8.11ms | 73ms |
| 4 | 0.10ms | 7ms | 1.05ms | 176ms |
| 5 | 0.07ms | 23ms | 0.95ms | 581ms |
The JSON-ification is clearly a bottleneck, partly due to my way of coding it as it expects array response at the top instead of object, and there's no ArrayToJSON, only ObjectToJSON available, hence my manual per object method. but the BuildTripResponse is not good either, losing up to 328x (the last result).
....Code: [Select]+------------+----------------------+-----------------------+-----------------+----------------+
...
| request no | Go buildTripResponse | FPC BuildTripResponse | Go json.Marshal | FPC FormatJSON |
+------------+----------------------+-----------------------+-----------------+----------------+
| 1 | 1.32ms | 1ms | 16.31ms | 36ms |
| 2 | 0.91ms | 4ms | 10.44ms | 90ms |
| 3 | 0.78ms | 3ms | 8.11ms | 73ms |
| 4 | 0.10ms | 7ms | 1.05ms | 176ms |
| 5 | 0.07ms | 23ms | 0.95ms | 581ms |
You could share your code on a gist (or somewhere else) so that we could look at it and optimized it to match mORMot's best performance needs.I've pushed it to the repo, with all the elapsed time counters. Feel free to dig in and find things to optimize. Btw, my test still shows that mORMot HTTP server bandwidth is not that much faster from fphttpserver, even similar. So maybe you can also take a look at that.
*Edit*: there are .o and .ppu files included in the repo. :(Dang, must be accidentally added when committing.
In fact, there are several HTTP servers.As the load test does concurrent attacks, I choose the first one.
THttpAsyncServer is meant for thousands of concurrent kept-alive clients.
THttpServer is socket-based, and use one thread per kept-alive client.
THttpApiServer is Windows only, and use the http.sys API.
For 50 concurrent clients, on Windows, THttpApiServer is likely to be the fastest, and perhaps THttpServer also would have good numbers.
From what I can see, the main bottleneck is not the server, but the trip/route process. A lot of allocations, and I am not sure that using generics is the fastest path.Please do, but please also compare with the original Go code from which I converted the code from.
I will try to reimplement it in a mORMot way.
ab@ab:~/dev/github/mORMot2$ wrk -c 50 -d 15s -t 4 http://localhost:4000/schedules/354
Running 15s test @ http://localhost:4000/schedules/354
4 threads and 50 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 16.54ms 4.18ms 53.47ms 91.70%
Req/Sec 728.48 62.13 840.00 72.00%
43545 requests in 15.02s, 8.96GB read
Requests/sec: 2898.99
Transfer/sec: 611.09MB
ab@ab :~/dev/github/mORMot2$ wrk -c 50 -d 15s -t 4 http://localhost:4000/schedules/7777777
Running 15s test @ http://localhost:4000/schedules/7777777
4 threads and 50 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 323.12us 396.82us 20.32ms 97.96%
Req/Sec 34.93k 1.70k 41.89k 72.83%
2086314 requests in 15.01s, 240.75MB read
Requests/sec: 138980.30
Transfer/sec: 16.04MB
Please check https://github.com/synopse/mORMot2/tree/master/ex/lang-cmpImpressive loading time on my machine:
This is the "mORMot way" of implementing the language comparison server.
parsed 1790905 stop times in 972.85ms
parsed 71091 trips in 45.05ms
- It uses mORMot RTTI to parse the CSV into records, and generate JSON from the results.I see you just added the CSV loading functionality and boy, mORMot's RTTI is surely powerful.
- We did not use a map/dictionary, but TDynArray feature of binary searching of sorted data.Lots of pointers, but I guess that what makes it fast, compared to the convenience of generics.
- By default, the CSV parse will "intern" all values to de-duplicate strings: it is slightly slower (around 30% I guess), but it reduces the memory a lot, and also speed up comparisons of identical strings (they both share the same pointer, so no need to compare the characters).
From what I can see, performance should be much better that the current pascal version.Done:
One amazing difference with other implementations is the memory consumption. A running server consume less than 70MB or RAM on my PC with all the data loaded - thanks to interning: it is 380 MB otherwise, which is much lower than alternatives anyway.
Please compare with GO and RUST on your machine.
running (0m34.7s), 00/50 VUs, 296 complete and 0 interrupted | running (0m31.8s), 00/50 VUs, 308 complete and 0 interrupted
default ✓ [======================================] 50 VUs 30 default ✓ [======================================] 50 VUs 30
data_received..................: 27 GB 763 MB/s | data_received..................: 30 GB 948 MB/s
data_sent......................: 2.8 MB 79 kB/s | data_sent......................: 2.9 MB 90 kB/s
http_req_blocked...............: avg=7.46µs min=1.09µs | http_req_blocked...............: avg=7.78µs min=1.04µs
http_req_connecting............: avg=1.17µs min=0s | http_req_connecting............: avg=2.71µs min=0s
http_req_duration..............: avg=56.71ms min=73.52µs | http_req_duration..............: avg=50.84ms min=143.5µs
{ expected_response:true }...: avg=56.71ms min=73.52µs | { expected_response:true }...: avg=50.84ms min=143.5µs
http_req_failed................: 0.00% ✓ 0 ✗ 2 | http_req_failed................: 0.00% ✓ 0 ✗ 3
http_req_receiving.............: avg=24.56ms min=16.13µs | http_req_receiving.............: avg=7.9ms min=15.58µs
http_req_sending...............: avg=23.58µs min=5.41µs | http_req_sending...............: avg=54.32µs min=5.56µs
http_req_tls_handshaking.......: avg=0s min=0s | http_req_tls_handshaking.......: avg=0s min=0s
http_req_waiting...............: avg=32.12ms min=45.78µs | http_req_waiting...............: avg=42.88ms min=98.37µs
http_reqs......................: 29304 844.805816/s | http_reqs......................: 30492 958.247358/s
iteration_duration.............: avg=5.62s min=4.38s | iteration_duration.............: avg=5.04s min=2s
iterations.....................: 296 8.533392/s | iterations.....................: 308 9.679266/s
vus............................: 13 min=13 max | vus............................: 30 min=30 max
vus_max........................: 50 min=50 max vus_max........................: 50 min=50 max
Go version still runs a little bit faster, but not by much. I still attribute this to their higher HTTP server download bandwidth for some reason.ab@ab:~/dev/github/mORMot2$ wrk -c 50 -d 15s -t 4 http://localhost:4000/schedules/354
Running 15s test @ http://localhost:4000/schedules/354
4 threads and 50 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 10.97ms 2.73ms 38.58ms 92.03%
Req/Sec 1.10k 70.60 1.30k 70.83%
65615 requests in 15.01s, 13.51GB read
Requests/sec: 4371.23
Transfer/sec: 0.90GB
You are right: some pointers, but they are typed pointers, so it is a somewhat safe approach.Nice!
Using array indexes with no runtime range checking is as unsafe as inc(typedpointer).
I have just commited a huge LangCmp sample speed-up
https://github.com/synopse/mORMot2/commit/cfdcb7f2
- we use PUtf8Char for JSON responses
- it is fair, since Rust is using &'data str
- string refcounting has a price, even if it does not allocate memory: but it has a thread-safe incremented lock, which is somewhat slow - even slower on ARM
- also made the sample Windows compatible
- should be faster than Go now - please try !
parsed 1790905 stop times in 968.43ms | parsed 1790905 stop times in 3.245251432s
parsed 71091 trips in 39.54ms | parsed 71091 trips in 85.747852ms
running (0m33.4s), 00/50 VUs, 348 complete and 0 interrupted | running (0m32.3s), 00/50 VUs, 320 complete and 0 interrupted
default ✓ [======================================] 50 VUs 30 default ✓ [======================================] 50 VUs 30
data_received..................: 31 GB 933 MB/s | data_received..................: 31 GB 971 MB/s
data_sent......................: 3.2 MB 97 kB/s | data_sent......................: 3.0 MB 92 kB/s
http_req_blocked...............: avg=9µs min=1.09µs | http_req_blocked...............: avg=6.77µs min=1.09µs
http_req_connecting............: avg=2.95µs min=0s | http_req_connecting............: avg=1.73µs min=0s
http_req_duration..............: avg=47.59ms min=97.28µs | http_req_duration..............: avg=49.02ms min=123.81µ
{ expected_response:true }...: avg=47.59ms min=97.28µs | { expected_response:true }...: avg=49.02ms min=123.81µ
http_req_failed................: 0.00% ✓ 0 ✗ | http_req_failed................: 0.00% ✓ 0 ✗ 3
http_req_receiving.............: avg=9.66ms min=15.35µs | http_req_receiving.............: avg=5.92ms min=14.76µs
http_req_sending...............: avg=87.24µs min=5.2µs | http_req_sending...............: avg=70.71µs min=5.2µs
http_req_tls_handshaking.......: avg=0s min=0s | http_req_tls_handshaking.......: avg=0s min=0s
http_req_waiting...............: avg=37.83ms min=54.74µs | http_req_waiting...............: avg=43.02ms min=91.84µs
http_reqs......................: 34452 1032.205528/s | http_reqs......................: 31680 981.949476/s
iteration_duration.............: avg=4.72s min=3.54s | iteration_duration.............: avg=4.86s min=2.19s
iterations.....................: 348 10.426318/s | iterations.....................: 320 9.918682/s
vus............................: 30 min=30 ma | vus............................: 15 min=15 max
vus_max........................: 50 min=50 ma | vus_max........................: 50 min=50 max
To be included on the main repository, performance may be less obvious on the Mac M1 / AARCH64 architecture.Yes, I don't have the machine either, so I can't test myself.
The mORMot HTTP server is less optimized on Mac (use good old poll API there), and there is no optimized asm involved.
But I have seen string interning to be of good benefit on AARCH64, because it avoids most memory allocations.
My problem is that I don't know how to propose a simple way of building the project with no FPC/Lazarus knowledge. ;)My local build uses -Fu<mormot2>/src/* and -Fi<mormot2>/src as well as -FU<somewhere else>, this way the compiled units will end up in a single directory instead of alongside their source code.
I wonder about the memory consumption of the server on your machine, e.g. in comparison with Go and its GC.Upon finished loading the CSV, it only eats 80MB, heck so little. Sounds a bit magical. But during load test, it fluctuates between 250-350MB, upon which it returns to 80MB at the end. I kinda feel like a GC behavior here, or just a proper memory deallocation scheme.
Freepascal has always tested good values for memory consumption. That is nothing new.
@FredvS
I never tested fpc-llvm to be fair.
I usually use fpcupdeluxe to setup my environment, and llvm is not supported by it, IIRC.
The original FreePascal version by Leledumbo consumed much more memory.True. A good example of how the convenience of object orientation, reference counted strings and generics can massively contribute to memory consumption as opposed to plain dynamic arrays (with clever growth), pchars and manual memory (de)allocation.
wrk -c 50 -d 15s -t 4 http://localhost:4000/schedules/354
Then try with -c 500 and perhaps -c 5000 (with proper ulimit set if needed). To see how they scale with a high number of connections.@LeledumboOK
Perhaps you may add my last blog article in the ReadMe:
https://blog.synopse.info/?post/2022/11/26/Modern-Pascal-is-Still-in-the-Race
Numbers on my laptop, using whatever versions of compilers I happened to have installed, not the latest but not very outdated either, no attempt to 'control the environment':LLVM magic is hard to beat, indeed.
Rust ran the fastest
As for Leledumbo's version, fphttpapp is the bottleneck.Acknowledged, but I can't figure out how to make it faster. Michael patched it with exception handling heavily to solve stability issue with it way back then, I suppose it will need a proper rewrite if performance has become a target.
<...>As for Leledumbo's version, fphttpapp is the bottleneck.Acknowledged, but I can't figure out how to make it faster. Michael patched it with exception handling heavily to solve stability issue with it way back then, I suppose it will need a proper rewrite if performance has become a target.
It looks like the problem is not just with fphttpapp?No, you are right, I don't think the web server is the issue with the Leledumbo's version.
<...>
Switching to mORMot Web Server did not make the code much better. Whereas the same mORMot Web Server with the mORMot JSON serialization is way faster - as fast as Go, i.e. reaching 1GB/s.
<...>
parsed 1739279 stop times in 0.611 seconds | parsed 1739278 stop times in 748.66ms
parsed 69754 trips in 0.042 seconds | parsed 69753 trips in 36.05ms
|
running (0m42.2s), 00/50 VUs, 100 complete and 0 inter| running (0m35.6s), 00/50 VUs, 100 complete and 0 interra
default ✓ [ 100% ] 50 VUs 30s default ✓ [ 100% ] 50 VUs 30ss
|
data_received..................: 8.5 GB 201 MB/s | data_received..................: 8.5 GB 239 MB/s
data_sent......................: 1.3 MB 30 kB/s | data_sent......................: 930 kB 26 kB/s
http_req_blocked...............: avg=2.75ms min=0s | http_req_blocked...............: avg=107.98µs min=0s
http_req_connecting............: avg=1.67ms min=0s | http_req_connecting............: avg=37.17µs min=0s
http_req_duration..............: avg=206.97ms min=0s | http_req_duration..............: avg=171.51ms min=0s
{ expected_response:true }...: avg=206.97ms min=0s | { expected_response:true }...: avg=171.51ms min=0s
http_req_failed................: 0.00% ✓ 0 http_req_failed................: 0.00% ✓ 0
http_req_receiving.............: avg=8.23ms min=0s | http_req_receiving.............: avg=123.33ms min=0s
http_req_sending...............: avg=1.35ms min=0s | http_req_sending...............: avg=192.63µs min=0s
http_req_tls_handshaking.......: avg=0s min=0s | http_req_tls_handshaking.......: avg=0s min=0s
http_req_waiting...............: avg=197.38ms min=0s | http_req_waiting...............: avg=47.99ms min=0s
http_reqs......................: 9900 234.367143/s | http_reqs......................: 9900 278.394605/s
iteration_duration.............: avg=20.72s min=18.| iteration_duration.............: avg=17.04s min=10.44s
iterations.....................: 100 2.367345/s | iterations.....................: 100 2.812067/s
vus............................: 16 min=16 | vus............................: 12 min=12
vus_max........................: 50 min=50 | vus_max........................: 50 min=50
Results, the right column shows the results of the application, which is named in Leledumbo's repository as alt:I think in your solution fphttpapp is now the bottleneck (look at http_req_blocked, http_req_connecting, http_req_sending and http_req_waiting, although interestingly http_req_receiving is the other way around but that's the only one that's significantly better), I know mine isn't yet, it's the too plain translation from Go with random inefficient replacements that I did was the actual bottleneck.
<...>
Btw, do you might want to incorporate this solution to my repo?
I also noticed that you are using PRow data, not high-level classes/records to store the data.
Another trick which is not fair, in respect to the language comparison goal.
What I don't understand is why the "sent" data is lower on the left. It should be the same for both sides.
Did you verify the JSON content?
Leledumbo
I just built FPC LLVM from source with LLVM 14. It doesn't compile the lgenerics library:
% fpc -Clv14.0 -Fulgenerics app.pas Free Pascal Compiler version 3.3.1 [2022/12/10] for x86_64 Copyright (c) 1993-2022 by Florian Klaempfl and others Target OS: Linux for x86-64 Compiling app.pas Compiling ./lgenerics/lgutils.pas Compiling ./lgenerics/lgstrconst.pas Writing Resource String Table file: lgstrconst.rsj Assembling lgstrconst lgutils.pas(3321,26) Warning: Function result variable of a managed type does not seem to be initialized Assembling lgutils Compiling ./lgenerics/lghashmap.pas Compiling ./lgenerics/lghelpers.pas Compiling ./lgenerics/lghash.pas lghash.pas(442,35) Fatal: Internal error 200611011 Fatal: Compilation aborted Error: /home/pierce/pkg/fpcllvm/lib/fpc/3.3.1/ppcx64 returned an error exitcode
This is the second program I'm building with FPC LLVM. The first one, hello world :P, builds and runs.
Doesn't it bother you that manual serialization in this case looks like an obvious cheating?Not really sure why, after all the goal of this language comparison is to compare possible solutions in each language, so if the language can do it, it's not cheating. At least that's my interpretation.