Recent

Author Topic: A new design for a JSON Parser  (Read 42695 times)

alpine

  • Hero Member
  • *****
  • Posts: 1032
Re: A new design for a JSON Parser
« Reply #60 on: July 28, 2021, 11:29:03 am »
*snip*
The FPC internal layouts are used to bypass the RTL when it makes a difference.
See mormot.core.rtti.pas about how we use the official typinfo unit as source, but encapsulate it into a Delphi/FPC compatible wrapper, and also introduce some RTTI cache as TRttiCustom/TRttiJson classes, with ready-to-use methods and settings.
Patching variants, strings, dynarrays, bypassing RTL, caching RTTI (OK, you named it), alternate ways of creating instances - that is what I meant. In other words - hacks: https://en.wikipedia.org/wiki/Hack_(computer_science)

mORMot users don't need to deal into those details. They just use the high level methods like JSON, ORM or SOA, letting the low level framework do its work.
*snip*
Until they hit the curb! I've been there.
"I'm sorry Dave, I'm afraid I can't do that."
—HAL 9000

sysrpl

  • Sr. Member
  • ****
  • Posts: 315
    • Get Lazarus
Re: A new design for a JSON Parser
« Reply #61 on: July 28, 2021, 11:29:34 am »
I am the author of JsonTools. I know guys it's been a long while since it was originally posted, but I will publish an update with fixes and some new features, including xpath like querying.

If you have any bug examples with JsonTools, or feature requests, please post them here. The update will likely be pushed this weekend, and I will provide a write up of the fixes and enhancements.

Anthony

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: A new design for a JSON Parser
« Reply #62 on: July 28, 2021, 01:05:21 pm »
If you have any bug examples with JsonTools, or feature requests, please post them here.

There is a small bug pointed out by Lainz here.
« Last Edit: July 28, 2021, 01:09:18 pm by engkin »

avk

  • Hero Member
  • *****
  • Posts: 752
Re: A new design for a JSON Parser
« Reply #63 on: July 28, 2021, 01:15:44 pm »
Hmm, maybe I don't understand something, but is this
Code: Javascript  [Select][+][-]
  1.   [ -01001, ,- , , ,42.e]
  2.  
the valid  JSON?
Anyway, Mormot.Core.Json.IsValidJson() claims yes.

PascalDragon

  • Hero Member
  • *****
  • Posts: 5444
  • Compiler Developer
Re: A new design for a JSON Parser
« Reply #64 on: July 28, 2021, 01:19:05 pm »
Variant version is faster, not much for being variant, because the underlining JSON parsing of mORMot. Being variant makes it simpler to use to some tastes.
By "simpler" I guess you mean writing J.X instead of C.Integers['X'], both of them require a lookup, but as the former depends on some compiler magic to skip quotes, the latter has at least a run-time type check. Both ways will require a Find('X') to ensure the attribute is present and there won't be a "bang".

There is no compiler magic that "skips quotes", but there is compiler magic that will replace J.X by (pseudocode) TDocVariantDataInstance.DispInvoke(@J, 'X'). So it is a bit more indirect, but in the end both will do the same to determine whether X exists.

Also this is not a hack, but a well defined feature of the Object Pascal language.


alpine

  • Hero Member
  • *****
  • Posts: 1032
Re: A new design for a JSON Parser
« Reply #65 on: July 28, 2021, 01:39:49 pm »
Variant version is faster, not much for being variant, because the underlining JSON parsing of mORMot. Being variant makes it simpler to use to some tastes.
By "simpler" I guess you mean writing J.X instead of C.Integers['X'], both of them require a lookup, but as the former depends on some compiler magic to skip quotes, the latter has at least a run-time type check. Both ways will require a Find('X') to ensure the attribute is present and there won't be a "bang".

There is no compiler magic that "skips quotes", but there is compiler magic that will replace J.X by (pseudocode) TDocVariantDataInstance.DispInvoke(@J, 'X'). So it is a bit more indirect, but in the end both will do the same to determine whether X exists.

Sorry if I didn't use correct wording, by 'skipping quotes' I meant calling fpc_dispinvoke_variant internally, instead of compile-time field address calculation, i.e. method calling, just like TJSONData.Integers['X']. Same as your explanation.

Also this is not a hack, but a well defined feature of the Object Pascal language.
I'm not calling that a hack.

Quote from: y.ivanov
Patching variants, strings, dynarrays, bypassing RTL, caching RTTI (OK, you named it), alternate ways of creating instances - that is what I meant. In other words - hacks
What I'm saying is that the mORMot source is full of hacks. All justified with one word: speed.
"I'm sorry Dave, I'm afraid I can't do that."
—HAL 9000

alpine

  • Hero Member
  • *****
  • Posts: 1032
Re: A new design for a JSON Parser
« Reply #66 on: July 28, 2021, 03:08:25 pm »
Hmm, maybe I don't understand something, but is this
Code: Javascript  [Select][+][-]
  1.   [ -01001, ,- , , ,42.e]
  2.  
the valid  JSON?
Anyway, Mormot.Core.Json.IsValidJson() claims yes.

My findings:
Code: Pascal  [Select][+][-]
  1.   WriteLn(VariantSaveJSON(_Json('[ -01001, 42.e]')));    // [-1001,"42.e"]
  2.   WriteLn(VariantSaveJSON(_Json('[ -01001, 42.0e]')));   // [-1001,42]
  3.   WriteLn(VariantSaveJSON(_Json('[ -01001, 42.0.1e]'))); // [-1001,"42.0.1e"]
  4.   WriteLn(VariantSaveJSON(_Json('[ -01001, 42.0e1]')));  // [-1001,42]
  5.   WriteLn(VariantSaveJSON(_Json('[ -01001, 42.0e1,]'))); // null
  6.   WriteLn(VariantSaveJSON(_Json('[ -01001, 42.0e1,false]'))); // [-1001,42,false]
  7.   WriteLn(VariantSaveJSON(_Json('[ -01001, 42.0e1,FALSE]'))); // null
  8.   WriteLn(VariantSaveJSON(_Json('[ -01001, 42.0e1,0FALSE]'))); // [-1001,42,"0FALSE"]

  • It accepts leading zeros in numbers (all lines) (!?)
  • It treats tokens starting with digit as strings when they are ill-formed numbers (lines 1,3,8) (!?!)
  • It accepts numbers with empty exponent (line 2) (!?!)

Edit:
It accepts leading zeros in numbers (all lines), because of wrong condition at:
https://github.com/synopse/mORMot2/blob/f2d748b39fd582a61d18e7972447724dbefb8a3b/src/core/mormot.core.variants.pas#L7345

It treats tokens starting with digit as strings when they are ill-formed numbers (lines 1,3,8), because of the default result at:
https://github.com/synopse/mORMot2/blob/f2d748b39fd582a61d18e7972447724dbefb8a3b/src/core/mormot.core.variants.pas#L7337
... and then ...
https://github.com/synopse/mORMot2/blob/f2d748b39fd582a61d18e7972447724dbefb8a3b/src/core/mormot.core.variants.pas#L7396
... or ...
https://github.com/synopse/mORMot2/blob/f2d748b39fd582a61d18e7972447724dbefb8a3b/src/core/mormot.core.variants.pas#L7413



« Last Edit: July 28, 2021, 05:52:59 pm by y.ivanov »
"I'm sorry Dave, I'm afraid I can't do that."
—HAL 9000

avk

  • Hero Member
  • *****
  • Posts: 752
Re: A new design for a JSON Parser
« Reply #67 on: July 29, 2021, 05:59:54 am »
Sometimes worse things can happen. If you feed Mormot.Core.Variants.JSONToVariant() JSON from an attachment, the parser will just silently die.

Edit:
At least in Win64, it really dies silently.
In Win32, it crashes with a yell "Out of memory".
« Last Edit: July 29, 2021, 06:26:09 am by avk »

abouchez

  • Full Member
  • ***
  • Posts: 110
    • Synopse
Re: A new design for a JSON Parser
« Reply #68 on: July 29, 2021, 08:09:59 am »
@avk
Thanks for the feedback!

IsValidJson() is meant to be fast, not strict. It guesses if the layout seems JSON. It has false positive for sure, but valid JSON will always be seen as such. It is used before feeding some slower functions.
This is a limitation I will try to circumvent. A bit more paranoid may not hurt.

About [[[[[[[[[......[[[[[[[[[ it is a nice catch. It should reject it directly.
But the "out of memory" seems like a stack overflow problem due to the recursive calls handling the arrays. But not easy to fix I guess: switching from recursion to a state machine will be some work for sure.

About numbers or invalid numbers converted into strings, this was seen as a feature: the idea is to not loose information if the data can't be stored.
For instance, with TDocVariant double values are not converted by default. It parses only currency values (up to 4 decimals) which are always correct. You need to add the dvoAllowDoubleValue option when you create the variant.

Note that TextToVariantNumberType() is not involved in this conversion. But GetNumericVariantFromJson() is used for TDocVariant. I will try to make it more compliant: it rejects 0123 but not -0123 indeed.

Last hint is that the mORMot JSON parser can be very relaxed. On purpose.
For instance, it supports the MongoDB extended syntax:{a:1,b:2} or {first:date("2021/01/23")} is seens as valid JSON. We use this syntax between mORMot client and server to save bandwidth.
It was never meant to be a strict JSON parser. Just an efficient JSON parser which will try to reflect its input.

avk

  • Hero Member
  • *****
  • Posts: 752
Re: A new design for a JSON Parser
« Reply #69 on: July 29, 2021, 08:58:24 am »
...
About [[[[[[[[[......[[[[[[[[[ it is a nice catch.
...

I didn't invent it myself.
In my post #44 I asked you a question about an interesting test suite, you didn't answer, so I decided to find out myself.

alpine

  • Hero Member
  • *****
  • Posts: 1032
Re: A new design for a JSON Parser
« Reply #70 on: July 29, 2021, 04:09:51 pm »
If you want a fast JSON parser for FPC, you may try what mORMot 2 offers.

Some numbers, parsing a JSON array of 8000 objects, for a bit more than 1MB:

Code: [Select]
  - JSON benchmark: 100,267 assertions passed  1.31s
     IsValidUtf8() in 16.63ms, 1.1 GB/s
     IsValidJson(RawUtf8) in 24.78ms, 790.8 MB/s
     IsValidJson(PUtf8Char) in 23.22ms, 843.9 MB/s
     JsonArrayCount(P) in 23.26ms, 842.7 MB/s
     JsonArrayCount(P,PMax) in 22.74ms, 862 MB/s
     JsonObjectPropCount() in 9.28ms, 1.1 GB/s
     TDocVariant in 140.43ms, 139.6 MB/s
     TDocVariant dvoInternNames in 156.73ms, 125 MB/s
     TOrmTableJson GetJsonValues in 24.98ms, 345.1 MB/s
     TOrmTableJson expanded in 37.36ms, 524.7 MB/s
     TOrmTableJson not expanded in 20.96ms, 411.2 MB/s
     fpjson in 810.40ms, 10.6 MB/s
I have certain doubts about the numbers written and their meanings. As far as I understand this is the output from the tests\mormot2tests program. For the specific tests cited, they don't do the same thing the last one does, i.e. full JSON parsing: fpjson := GetJSON(people, {utf8=}true).
In particular:
  • IsValidUtf8() - returns TRUE if the supplied buffer has valid UTF-8 encoding - irrelevant
  • IsValidJson(RawUtf8), IsValidJson(PUtf8Char) - returns true when argument looks like a JSON - irrelevant
  • JsonArrayCount() - does a flat scan, counting the elements in a JSON, which is assumed to be an array
  • JsonObjectPropCount() - counts number of fields in 1-st object into the JSON array
  • TOrmTableJson - does a partial split (1-st level) of a JSON, which is assumed to be an array
Throughput is calculated as length of the parsed data (* iterations) for the duration of execution, normalized in seconds. But how can they be compared when they do different things?

The only relevant test I see is TDocVariant. In the above figures TDocVariant is 139.6 MB/s vs 10.6 MB/s for fpjson. Let say that it is approx. 13:1 result.
At the other hand, I ran the recent mormot2tests as-is from the github. What I've got is:
Code: Text  [Select][+][-]
  1. - JSON benchmark: 100,293 assertions passed  3.13s
  2.      StrLen() in 1.01ms, 18.8 GB/s
  3.      IsValidUtf8(RawUtf8) in 14.84ms, 1.2 GB/s
  4.      IsValidUtf8(PUtf8Char) in 17.48ms, 1 GB/s
  5.      IsValidJson(RawUtf8) in 41.40ms, 473.4 MB/s
  6.      IsValidJson(PUtf8Char) in 39.19ms, 500.2 MB/s
  7.      JsonArrayCount(P) in 39.87ms, 491.6 MB/s
  8.      JsonArrayCount(P,PMax) in 37.54ms, 522.2 MB/s
  9.      JsonObjectPropCount() in 12.99ms, 873.4 MB/s
  10.      TDocVariant in 1.22s, 15.9 MB/s
  11.      TDocVariant dvoInternNames in 710.61ms, 27.5 MB/s
  12.      TOrmTableJson GetJsonValues in 36.38ms, 237 MB/s
  13.      TOrmTableJson expanded in 66.58ms, 294.4 MB/s
  14.      TOrmTableJson not expanded in 41.52ms, 207.7 MB/s
  15.      DynArrayLoadJson in 359.06ms, 54.6 MB/s
  16.      fpjson in 480.29ms, 4 MB/s

In short, mORMot 2 JSON parser is from 13 times to 50 times faster than fpjson - and I guess JSON tools.
Figures are: TDocVariant is 15.9 MB/s vs 4 MB/s for fpjson. That is 4:1. Not 50:1! Not 13:1!
I think 4:1 is also impressive, but in no way same as 13-50 to 1.


"I'm sorry Dave, I'm afraid I can't do that."
—HAL 9000

abouchez

  • Full Member
  • ***
  • Posts: 110
    • Synopse
Re: A new design for a JSON Parser
« Reply #71 on: July 29, 2021, 05:11:15 pm »
The FPC tests were done on Linux x86_64 with FPC 3.2.0 and -O3.
So it is properly inlined, and uses our x86_64 memory manager written in asm.
Which are the recommended settings for a production server. You get similar results on Win64.
I guess you are using Win32 with the default FPC memory manager.

Some numbers are indeed irrelevant when comparing to other parsers - but meaningful anyway, since they are used within the framework and I wanted to have wide  benchmark values. For instance, IsValidUtf8() on x86_64 is around 13GB/s on my AVX2 CPU.

The relevant parsers, making a full parsing and extracting, are indeed:
- TDocVariant = object/array nodes parser storing values as variants, stored in a TDocVariantData record which could be mapped as a custom variant type;
- TOrmTableJson = array parser, optimized for an ORM list, with in-place unescape and #0 ending, creating a list of PUtf8Char to each value;
- DynArrayLoadJson = array parser, filling a dynamic array of records with all values.

As I wrote after this initial post, the initial numbers above were wrong about fpjson, due to an invalid length in the calculation. I have then published new values.

I have rewritten the JSON validator to be stronger and also slightly faster.
With the latest JSON validator (which I will commit in the next hours), here are some updated numbers:
Code: [Select]
- JSON benchmark: 100,307 assertions passed  843.40ms
     StrLen() in 826us, 23.1 GB/s
     IsValidUtf8(RawUtf8) in 1.46ms, 13 GB/s
     IsValidUtf8(PUtf8Char) in 2.29ms, 8.3 GB/s
     IsValidJson(RawUtf8) in 20.74ms, 0.9 GB/s
     IsValidJson(PUtf8Char) in 20.95ms, 0.9 GB/s
     JsonArrayCount(P) in 20.12ms, 0.9 GB/s
     JsonArrayCount(P,PMax) in 19.97ms, 0.9 GB/s
     JsonObjectPropCount() in 10.98ms, 1 GB/s
     TDocVariant in 123.71ms, 158.4 MB/s
     TDocVariant dvoInternNames in 146.39ms, 133.9 MB/s
     TOrmTableJson GetJsonValues in 24.31ms, 354.5 MB/s
     TOrmTableJson expanded in 39.12ms, 501 MB/s
     TOrmTableJson not expanded in 20.89ms, 412.6 MB/s
     DynArrayLoadJson in 61.68ms, 317.8 MB/s
     fpjson in 79.39ms, 24.6 MB/s
     jsontools in 50.50ms, 38.8 MB/s
     SuperObject in 184.59ms, 10.6 MB/s
The numbers slightly change during each call on my Core i5 laptop, but the order of magnitude remains.
DynArrayLoadJson is almost 13 times faster than fpjson, and TOrmTableJson is 20 times faster. TDocVariant is "only" 6 times faster, and if you intern the names (i.e. you don't allocate a string for each property name, but reuse an existing string with refcnt + 1 which reduces a lot the memory usage) it is slightly below 6 times faster.
Since jsontools is faster then fpjson, the numbers are lower for it, but DynArrayLoadJson is still 8 times faster than jsontools, with a lot less memory consumption.

abouchez

  • Full Member
  • ***
  • Posts: 110
    • Synopse
Re: A new design for a JSON Parser
« Reply #72 on: July 29, 2021, 05:20:36 pm »
Here are some numbers on Win32:
Code: [Select]
  - Encode decode JSON: 430,145 assertions passed  97.31ms
  - JSON benchmark: 100,307 assertions passed  1.03s
     StrLen() in 813us, 23.5 GB/s
     IsValidUtf8(RawUtf8) in 11.08ms, 1.7 GB/s
     IsValidUtf8(PUtf8Char) in 11.91ms, 1.6 GB/s
     IsValidJson(RawUtf8) in 23.44ms, 836.2 MB/s
     IsValidJson(PUtf8Char) in 21.99ms, 891.6 MB/s
     JsonArrayCount(P) in 21.29ms, 920.6 MB/s
     JsonArrayCount(P,PMax) in 21.38ms, 917 MB/s
     JsonObjectPropCount() in 10.54ms, 1 GB/s
     TDocVariant in 200.88ms, 97.6 MB/s
     TDocVariant dvoInternNames in 196.29ms, 99.8 MB/s
     TOrmTableJson GetJsonValues in 24.37ms, 353.9 MB/s
     TOrmTableJson expanded in 44.57ms, 439.8 MB/s
     TOrmTableJson not expanded in 30.68ms, 281 MB/s
     DynArrayLoadJson in 88.07ms, 222.6 MB/s
     fpjson in 74.31ms, 26.3 MB/s
     jsontools in 61.22ms, 32 MB/s
     SuperObject in 178.94ms, 10.9 MB/s
On Win32, DynArrayLoadJson is still 8 times faster than fpjson and 7 times faster than jsontools.

When you see that dvoInternNames is faster than the plain TDocVariant, we can guess that on Win32 the FPC memory manager becomes a bottleneck: our AesNiHash + cached string assignment is faster than direct string allocation.

Edit: Please check https://github.com/synopse/mORMot2/commit/10cef0c3edeee786707ac23aaa91c6e23842bce8 about the new state engine.
IsValidJson() GotoEndJsonItemStrict() JsonArrayCount() JsonObjectPropCount() are still relaxed on numbers, escaped strings and commas, but [[[[[...[[[[[[ will be detected, with an explicit Strict: boolean parameter if you really don't want to parse the MongoDB extended JSON syntax. On my laptop, the state machine reaches 900MB/s. Of course, it is not a full parser, but it is used e.g. by TDocVariant or DynArrayLoadJson to guess the number of items of the supplied JSON, to pre-allocate the buffers. I will now try to integrate it in other mORMot methods, instead of dedicated written code.
« Last Edit: July 29, 2021, 05:38:11 pm by abouchez »

BeniBela

  • Hero Member
  • *****
  • Posts: 905
    • homepage
Re: A new design for a JSON Parser
« Reply #73 on: July 29, 2021, 05:58:47 pm »
fpjson also has three different variants. jsonparser, jsonreader, and jsonscanner

You should benchmark them all.

alpine

  • Hero Member
  • *****
  • Posts: 1032
Re: A new design for a JSON Parser
« Reply #74 on: July 29, 2021, 06:20:15 pm »
@abouchez,
From your latest figures I can see that a more realistic TDocVariant vs fpjson ratio emerges: 3.711.You continue to compare items which do other things and were specially crafted for specific use cases. TDocVariant is the only thing that is comparable. Do you realize that you're throwing some unrelated numbers and after that you state a general conclusion below?
"I'm sorry Dave, I'm afraid I can't do that."
—HAL 9000

 

TinyPortal © 2005-2018