Recent

Author Topic: Yet another generics collection  (Read 23889 times)

avk

  • Sr. Member
  • ****
  • Posts: 490
    • my self-education project
Re: Yet another generics collection
« Reply #75 on: January 05, 2021, 05:25:13 pm »
It's a good question anyway. I do my best to avoid adding nodes directly, as this is error-prone. However, double parsing isn't a very attractive solution either.
I just hastily added a new ArrayAppend method, maybe that's what you need?

lainz

  • Hero Member
  • *****
  • Posts: 3848
Re: Yet another generics collection
« Reply #76 on: January 05, 2021, 09:12:42 pm »
Yes that works.

Anyways I get a slower time this time with the change. But I'm not measuring it well, so I can't say, I'm measuring a whole process that depends on internet, so speed can vary and is not only by the json library, but the server speed, internet speed and other things.
« Last Edit: January 05, 2021, 09:22:43 pm by lainz »

avk

  • Sr. Member
  • ****
  • Posts: 490
    • my self-education project
Re: Yet another generics collection
« Reply #77 on: January 18, 2021, 03:10:59 pm »
Added pull-style JSON stream reader.

It provides, at the user's choice
  • read all tokens from the stream one by one
  • skip some structures
  • copy selected structures
  • iterate over all items of the current structure
  • search for an item in the current structure by a given key
  • search for an item by a specified path

As for performance, query
Code: Pascal  [Select][+][-]
  1.   Reader.FindPath('/features/100001')
  2.  
on this JSON(~190MB) takes about 0.7 s. on my machine.

avk

  • Sr. Member
  • ****
  • Posts: 490
    • my self-education project
Re: Yet another generics collection
« Reply #78 on: February 22, 2021, 06:59:47 am »
Added implementation of two string matching algorithms (unit lgStrHelpers), both are inheritors of the Boyer-Moore algorithm with minor improvements or simplifications.
The TBmSearch entity implements a Boyer-Moore incarnation called Fast-Search, and TBmhrSearch implements the Boyer-Moore-Horspool-Raita algorithm(sometimes just Raita). TBmSearchCI is a case-insensitive version of TBmSearch.

A simple comparative benchmark on various inputs; BuildinBM denotes FindMatchesBoyerMooreCaseSensitive from StrUtils.

Random string(200000000 bytes) over a five-character alphabet, search for 10 patterns(running time in ms, lower is better).
Code: Text  [Select][+][-]
  1.  pattern len       TBmSearch    TBmhrSearch  BuildinBM
  2.      5               3157          3500        4017
  3.     10               2235          2487        2814
  4.     20               2078          2687        2627
  5.     50               1455          2078        1812
  6.    100               1437          2502        1767
  7.    200               1375          2422        1749
  8.  

Natural language text(English, "The Forsyte Saga", 2079763 bytes), search for 1000 patterns(running time in ms, lower is better).
Code: Text  [Select][+][-]
  1.  pattern len       TBmSearch    TBmhrSearch  BuildinBM
  2.      5               1969          2266        2799
  3.     10               1125          1297        1593
  4.     20                704           811         984
  5.     50                470           516         609
  6.    100                376           405         468
  7.    200                312           327         374
  8.  

Search for a single pattern in a list(200000 items) of moderate-length strings(5000-15000 bytes)(running time in ms, lower is better).
Code: Text  [Select][+][-]
  1.  pattern len       TBmSearch    TBmhrSearch  BuildinBM
  2.      5               2653          3311        3343
  3.     10               1016          1171        1530
  4.     20                842           875        1169
  5.     50                358           375         558
  6.    100                374           419         672
  7.    200                296           314         640
  8.  

Case insensitive search, Pascal sources(83996301 bytes), search for 25 patterns, BuildinBM_CI here denotes FindMatchesBoyerMooreCaseInSensitive(running time in ms, lower is better).
Code: Text  [Select][+][-]
  1.  pattern len       TBmSearchCI    BuildinBM_CI
  2.      5               2565            3250
  3.     10               1488            1744
  4.     20                889            1042
  5.     50                618             709
  6.    100                444             517
  7.    200                390             442
  8.  
« Last Edit: February 22, 2021, 08:17:46 am by avk »

440bx

  • Hero Member
  • *****
  • Posts: 2501
Re: Yet another generics collection
« Reply #79 on: February 22, 2021, 07:36:35 am »
The TBmSearch entity implements a Boyer-Moore incarnation called Fast-Search, and TBmhrSearch implements the Boyer-Moore-Horspool-Raita algorithm(sometimes just Raita). TBmSearchCI is a case-insensitive version of TBmSearch.
Nicely done!.
FPC v3.0.4 and Lazarus 1.8.2 on Windows 7 64bit.

avk

  • Sr. Member
  • ****
  • Posts: 490
    • my self-education project
Re: Yet another generics collection
« Reply #80 on: February 22, 2021, 07:52:21 am »
Thank you.

Okoba

  • Full Member
  • ***
  • Posts: 249
Re: Yet another generics collection
« Reply #81 on: February 22, 2021, 02:05:57 pm »
Well done avk.

avk

  • Sr. Member
  • ****
  • Posts: 490
    • my self-education project
Re: Yet another generics collection
« Reply #82 on: August 02, 2021, 11:50:55 am »
Has rerun the JSON parsing benchmark(in attachment) due to the recent update of JsonTools, this time without Delphi.
Also below are the results of running parser tests for compliance with the current JSON standard. Only tests from the "SHOULD_BE_ACCEPTED" and "SHOULD_BE_REJECTED" sections were run.
Code: Text  [Select][+][-]
  1. lgJson:
  2.   total:  283
  3.   failed: 0
  4.  
  5. JsonTools:
  6.   total:  283
  7.   failed: 22
  8.  
  9. FpJson:
  10.   crash
  11.  
« Last Edit: August 02, 2021, 12:09:15 pm by avk »

avk

  • Sr. Member
  • ****
  • Posts: 490
    • my self-education project
Re: Yet another generics collection
« Reply #83 on: October 15, 2021, 01:22:56 pm »
lgJson now uses the Eisel-Lemire approximation algorithm for string-to-double conversion, and the Ryū algorithm for the reverse conversion.
The former is used exclusively as an internal function, but the Ryū implementation(Double2Str) has been made public and, accordingly, its performance can be compared with other double-to-string converters.
The selected subjects are divided into two groups: the first does not make an explicit memory allocation, and the second does.
The first group includes DoubleToShort from Mormot2, the built-in FloatToText, Double2Str(procedure) from lgJson and the built-in Str(shortstring).
The second includes DoubleToStr from Mormot2, the built-in FloatToStr, RyuDoubleToString from PasDblStrUtils, Double2Str(function) from lgJson and the built-in Str(string).
The benchmark generates an array of 10000000 random doubles and measures the time(ms) it takes to process the entire array(5 attempts, the best time is taken as the result).

Results:
Code: Text  [Select][+][-]
  1. 1              x86_64  x86
  2. ---------------------------
  3.   DoubleToShort 1216  2106
  4.   Str           1918  4399
  5.   FloatToText   4321  7191
  6.   Double2Str     858  2761
  7. ---------------------------
  8.  
  9. 2                  x86_64  x86
  10. -------------------------------
  11.   DoubleToStr       1903  2745
  12.   Str               1918  4415
  13.   FloatToStr        4196  6505
  14.   RyuDoubleToString 2823  7441
  15.   Double2Str        1550  3463
  16. -------------------------------
  17.  

It seems that parsing in lgJson has become noticeably faster, so it makes sense to run the parser benchmark again.

The list of samples has changed a bit and a parser from InternetTools (XQJsonParser) has been added. InternetTools was compiled with USE_PASDBLSTRUTILS define. The rest of the test subjects remained the same. The results on a win-64 machine can be seen in the attached bar chart.

Okoba

  • Full Member
  • ***
  • Posts: 249
Re: Yet another generics collection
« Reply #84 on: October 17, 2021, 11:40:57 am »
Great as always, avk. It is very nice that you did implement them in Pascal, it is worth mentioning them in the LGenerics readme, and maybe tell the source ones to add yours in their readme too.
Did you compare your implementation with the source ones? It seems like an interesting thing to find out the output code for two compilers.
May I ask why you did not public the Eisel-Lemire procedure?
It seems Ryu has a String to Double method too, do you have plans to test it too?

avk

  • Sr. Member
  • ****
  • Posts: 490
    • my self-education project
Re: Yet another generics collection
« Reply #85 on: October 17, 2021, 03:25:43 pm »
Thank you.
Actually, ports of these algorithms to Pascal have already existed, at least here(don't know how long ago), so my port is definitely not the first. Perhaps it really makes sense to mention them in the readme, despite the fact that they are tied to JSON.

...
Did you compare your implementation with the source ones? It seems like an interesting thing to find out the output code for two compilers.
...
Yes, you're right, that would be interesting, maybe a little later...

...
May I ask why you did not public the Eisel-Lemire procedure?
...
To be honest, I didn't expect it to be interesting to anyone else. It is used to convert a string that has already been checked by the main parser, therefore it uses a weakened algorithm for parsing. Nothing, of course, prevents from making a normal parser for it, I will try and report the results.

...
It seems Ryu has a String to Double method too, do you have plans to test it too?
No, I don't plan to, but it seems that PasDblStrUtils has an implementation of it.
« Last Edit: October 17, 2021, 03:33:37 pm by avk »

 

TinyPortal © 2005-2018