Recent

Author Topic: LGenerics: yet another generics collection  (Read 37351 times)

avk

  • Hero Member
  • *****
  • Posts: 752
Re: Yet another generics collection
« Reply #75 on: January 05, 2021, 05:25:13 pm »
It's a good question anyway. I do my best to avoid adding nodes directly, as this is error-prone. However, double parsing isn't a very attractive solution either.
I just hastily added a new ArrayAppend method, maybe that's what you need?

lainz

  • Hero Member
  • *****
  • Posts: 4449
    • https://lainz.github.io/
Re: Yet another generics collection
« Reply #76 on: January 05, 2021, 09:12:42 pm »
Yes that works.

Anyways I get a slower time this time with the change. But I'm not measuring it well, so I can't say, I'm measuring a whole process that depends on internet, so speed can vary and is not only by the json library, but the server speed, internet speed and other things.
« Last Edit: January 05, 2021, 09:22:43 pm by lainz »

avk

  • Hero Member
  • *****
  • Posts: 752
Re: Yet another generics collection
« Reply #77 on: January 18, 2021, 03:10:59 pm »
Added pull-style JSON stream reader.

It provides, at the user's choice
  • read all tokens from the stream one by one
  • skip some structures
  • copy selected structures
  • iterate over all items of the current structure
  • search for an item in the current structure by a given key
  • search for an item by a specified path

As for performance, query
Code: Pascal  [Select][+][-]
  1.   Reader.FindPath('/features/100001')
  2.  
on this JSON(~190MB) takes about 0.7 s. on my machine.

avk

  • Hero Member
  • *****
  • Posts: 752
Re: Yet another generics collection
« Reply #78 on: February 22, 2021, 06:59:47 am »
Added implementation of two string matching algorithms (unit lgStrHelpers), both are inheritors of the Boyer-Moore algorithm with minor improvements or simplifications.
The TBmSearch entity implements a Boyer-Moore incarnation called Fast-Search, and TBmhrSearch implements the Boyer-Moore-Horspool-Raita algorithm(sometimes just Raita). TBmSearchCI is a case-insensitive version of TBmSearch.

A simple comparative benchmark on various inputs; BuildinBM denotes FindMatchesBoyerMooreCaseSensitive from StrUtils.

Random string(200000000 bytes) over a five-character alphabet, search for 10 patterns(running time in ms, lower is better).
Code: Text  [Select][+][-]
  1.  pattern len       TBmSearch    TBmhrSearch  BuildinBM
  2.      5               3157          3500        4017
  3.     10               2235          2487        2814
  4.     20               2078          2687        2627
  5.     50               1455          2078        1812
  6.    100               1437          2502        1767
  7.    200               1375          2422        1749
  8.  

Natural language text(English, "The Forsyte Saga", 2079763 bytes), search for 1000 patterns(running time in ms, lower is better).
Code: Text  [Select][+][-]
  1.  pattern len       TBmSearch    TBmhrSearch  BuildinBM
  2.      5               1969          2266        2799
  3.     10               1125          1297        1593
  4.     20                704           811         984
  5.     50                470           516         609
  6.    100                376           405         468
  7.    200                312           327         374
  8.  

Search for a single pattern in a list(200000 items) of moderate-length strings(5000-15000 bytes)(running time in ms, lower is better).
Code: Text  [Select][+][-]
  1.  pattern len       TBmSearch    TBmhrSearch  BuildinBM
  2.      5               2653          3311        3343
  3.     10               1016          1171        1530
  4.     20                842           875        1169
  5.     50                358           375         558
  6.    100                374           419         672
  7.    200                296           314         640
  8.  

Case insensitive search, Pascal sources(83996301 bytes), search for 25 patterns, BuildinBM_CI here denotes FindMatchesBoyerMooreCaseInSensitive(running time in ms, lower is better).
Code: Text  [Select][+][-]
  1.  pattern len       TBmSearchCI    BuildinBM_CI
  2.      5               2565            3250
  3.     10               1488            1744
  4.     20                889            1042
  5.     50                618             709
  6.    100                444             517
  7.    200                390             442
  8.  
« Last Edit: February 22, 2021, 08:17:46 am by avk »

440bx

  • Hero Member
  • *****
  • Posts: 3921
Re: Yet another generics collection
« Reply #79 on: February 22, 2021, 07:36:35 am »
The TBmSearch entity implements a Boyer-Moore incarnation called Fast-Search, and TBmhrSearch implements the Boyer-Moore-Horspool-Raita algorithm(sometimes just Raita). TBmSearchCI is a case-insensitive version of TBmSearch.
Nicely done!.
(FPC v3.0.4 and Lazarus 1.8.2) or (FPC v3.2.2 and Lazarus v3.2) on Windows 7 SP1 64bit.

avk

  • Hero Member
  • *****
  • Posts: 752
Re: Yet another generics collection
« Reply #80 on: February 22, 2021, 07:52:21 am »
Thank you.

Okoba

  • Hero Member
  • *****
  • Posts: 528
Re: Yet another generics collection
« Reply #81 on: February 22, 2021, 02:05:57 pm »
Well done avk.

avk

  • Hero Member
  • *****
  • Posts: 752
Re: Yet another generics collection
« Reply #82 on: August 02, 2021, 11:50:55 am »
Has rerun the JSON parsing benchmark(in attachment) due to the recent update of JsonTools, this time without Delphi.
Also below are the results of running parser tests for compliance with the current JSON standard. Only tests from the "SHOULD_BE_ACCEPTED" and "SHOULD_BE_REJECTED" sections were run.
Code: Text  [Select][+][-]
  1. lgJson:
  2.   total:  283
  3.   failed: 0
  4.  
  5. JsonTools:
  6.   total:  283
  7.   failed: 22
  8.  
  9. FpJson:
  10.   crash
  11.  
« Last Edit: August 02, 2021, 12:09:15 pm by avk »

avk

  • Hero Member
  • *****
  • Posts: 752
Re: Yet another generics collection
« Reply #83 on: October 15, 2021, 01:22:56 pm »
lgJson now uses the Eisel-Lemire approximation algorithm for string-to-double conversion, and the Ryū algorithm for the reverse conversion.
The former is used exclusively as an internal function, but the Ryū implementation(Double2Str) has been made public and, accordingly, its performance can be compared with other double-to-string converters.
The selected subjects are divided into two groups: the first does not make an explicit memory allocation, and the second does.
The first group includes DoubleToShort from Mormot2, the built-in FloatToText, Double2Str(procedure) from lgJson and the built-in Str(shortstring).
The second includes DoubleToStr from Mormot2, the built-in FloatToStr, RyuDoubleToString from PasDblStrUtils, Double2Str(function) from lgJson and the built-in Str(string).
The benchmark generates an array of 10000000 random doubles and measures the time(ms) it takes to process the entire array(5 attempts, the best time is taken as the result).

Results:
Code: Text  [Select][+][-]
  1. 1              x86_64  x86
  2. ---------------------------
  3.   DoubleToShort 1216  2106
  4.   Str           1918  4399
  5.   FloatToText   4321  7191
  6.   Double2Str     858  2761
  7. ---------------------------
  8.  
  9. 2                  x86_64  x86
  10. -------------------------------
  11.   DoubleToStr       2480  3432
  12.   Str               2652  5163
  13.   FloatToStr        3588  6349
  14.   RyuDoubleToString 2589  7269
  15.   Double2Str        1482  3322
  16. -------------------------------
  17.  

It seems that parsing in lgJson has become noticeably faster, so it makes sense to run the parser benchmark again.

The list of samples has changed a bit and a parser from InternetTools (XQJsonParser) has been added. InternetTools was compiled with USE_PASDBLSTRUTILS define. The rest of the test subjects remained the same. The results on a win-64 machine can be seen in the attached bar chart.

Edit.
Sorry, incorrect data for the second group was provided. Fixed now.
« Last Edit: October 19, 2021, 08:10:17 pm by avk »

Okoba

  • Hero Member
  • *****
  • Posts: 528
Re: Yet another generics collection
« Reply #84 on: October 17, 2021, 11:40:57 am »
Great as always, avk. It is very nice that you did implement them in Pascal, it is worth mentioning them in the LGenerics readme, and maybe tell the source ones to add yours in their readme too.
Did you compare your implementation with the source ones? It seems like an interesting thing to find out the output code for two compilers.
May I ask why you did not public the Eisel-Lemire procedure?
It seems Ryu has a String to Double method too, do you have plans to test it too?

avk

  • Hero Member
  • *****
  • Posts: 752
Re: Yet another generics collection
« Reply #85 on: October 17, 2021, 03:25:43 pm »
Thank you.
Actually, ports of these algorithms to Pascal have already existed, at least here(don't know how long ago), so my port is definitely not the first. Perhaps it really makes sense to mention them in the readme, despite the fact that they are tied to JSON.

...
Did you compare your implementation with the source ones? It seems like an interesting thing to find out the output code for two compilers.
...
Yes, you're right, that would be interesting, maybe a little later...

...
May I ask why you did not public the Eisel-Lemire procedure?
...
To be honest, I didn't expect it to be interesting to anyone else. It is used to convert a string that has already been checked by the main parser, therefore it uses a weakened algorithm for parsing. Nothing, of course, prevents from making a normal parser for it, I will try and report the results.

...
It seems Ryu has a String to Double method too, do you have plans to test it too?
No, I don't plan to, but it seems that PasDblStrUtils has an implementation of it.
« Last Edit: October 17, 2021, 03:33:37 pm by avk »

avk

  • Hero Member
  • *****
  • Posts: 752
Re: Yet another generics collection
« Reply #86 on: October 22, 2021, 02:01:10 pm »
lgJson now has a public TryStr2Double() function that uses the Eisel-Lemire algorithm.
Below are the results of a quick and dirty benchmark involving Val(), ConvertStringToDouble() from PasDblStrUtils and fast_double_parser::parse_number().
Two input files were used, the first contains random uniformly distributed(well, almost, there are no nans and infinities) numbers, and the second contains numbers extracted from citylots.json. Each file was looped four times in five attempts, the best time (ms) was taken as the result.
fast_double_parser was compiled with gcc 8.1.0 with options -O3 -s -static -m64.
Code: Text  [Select][+][-]
  1. uniform.txt
  2. read 5000000 lines
  3. total size 109829099 bytes
  4.  
  5.                       x86_64      x86
  6. ------------------------------------------------
  7. Val                    3310      6692
  8. ConvertStringToDouble  1350      2886
  9. TryStr2Double           936      2152
  10. parse_number            760       nt*
  11.  
  12.  
  13.  
  14. citylots.txt
  15. read 5250126 lines
  16. total size 99166601 bytes
  17.  
  18.                       x86_64      x86
  19. ------------------------------------------------
  20. Val                    2558      5818
  21. ConvertStringToDouble  1060      2512
  22. TryStr2Double           639      1825
  23. parse_number            360       nt*
  24.                                      
  25.  
* not tested.

Also compiled the original Ryū version and added its results to the table:
Code: Text  [Select][+][-]
  1. 1              x86_64  x86
  2. ---------------------------
  3.   DoubleToShort 1216  2106
  4.   Str           1918  4399
  5.   FloatToText   4321  7191
  6.   Double2Str     858  2761
  7.   d2s_buffered   514   nt*
  8. ---------------------------
  9.  
  10. 2                  x86_64  x86
  11. -------------------------------
  12.   DoubleToStr       2480  3432
  13.   Str               2652  5163
  14.   FloatToStr        3588  6349
  15.   RyuDoubleToString 2589  7269
  16.   Double2Str        1482  3322
  17.   d2s               1253   nt*
  18. -------------------------------
  19.  
« Last Edit: October 22, 2021, 02:03:30 pm by avk »

AlexTP

  • Hero Member
  • *****
  • Posts: 2365
    • UVviewsoft
Re: Yet another generics collection
« Reply #87 on: October 22, 2021, 03:03:18 pm »
As I see, you created the new StrToFloat function, which is very fast. Maybe incorporate it into FPC then?

avk

  • Hero Member
  • *****
  • Posts: 752
Re: Yet another generics collection
« Reply #88 on: October 22, 2021, 03:50:33 pm »
I know that some devs read this forum, maybe it will interest them.

avk

  • Hero Member
  • *****
  • Posts: 752
Re: Yet another generics collection
« Reply #89 on: March 09, 2022, 11:00:40 am »
Hi!
Some news from LGenerics:
recently, a number of functions have been added that implement some algorithms on strings and sequences, regarding metrics and fuzzy matching, the list is annotated in the README of the project. The example/fuzz_bench folder contains several benchmarks for these functions.

It was curious to compare the performance of the SimRatioLevEx() function(which is inspired by FuzzyWuzzy) with some incarnations of the FuzzyWuzzy(listed here) on my benchmark datasets. Disclamer: SimRatioLevEx() does not reproduce FuzzyWuzzy, but it does some things in a similar way, in particular, SimRatioLevEx() in smTokenSetEx mode and token_set_ratio() do roughly the same job.

It seems the C++ version only supports single-byte strings, so only compared to the single-byte version of SimRatioLevEx():
Code: Text  [Select][+][-]
  1.    Dataset            SimRatioLevEx() token_set_ratio()
  2.  
  3. Short/big_dist             1154              6440
  4. Short/small_dist            967              3020
  5. Medium/big_dist             811              3450
  6. Medium/small_dist           702              1470
  7. Long/big_dist              1966             15000
  8. Long/small_dist            1061              2250
  9.  
The numbers indicate the run time in milliseconds; the C++ version was compiled with gcc-8.1.0 with options -O3 -m64; the Pascal version was compiled with FPC-3.3.1-9941-g8e6e9bbf33, -O3. The benchmark was run on a win-64 machine.

The Go version, on the contrary, always works with Unicode strings:
Code: Text  [Select][+][-]
  1.    Dataset          SimRatioLevExUtf8() TokenSetRatio()
  2.  
  3. Short/big_dist             2143             18705
  4. Short/small_dist           1593              2224
  5. Medium/big_dist            1266             15062
  6. Medium/small_dist           853              1742
  7. Long/big_dist              3853             79851
  8. Long/small_dist            1269              3126
  9.  

go version: go1.10.4 linux/amd64; FPC-3.3.1-10683-g2a19e152b7 -O3. The benchmark was run on a virtual Linux machine.

Happy coding!

 

TinyPortal © 2005-2018