Recent

Author Topic: [SOLVED]Is there a way for TJSONParser to deal with a String/Stream having a BOM  (Read 6327 times)

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 1120
  • Professional amateur ;-P
Hi there,

Is there a way I can assure that TJSONParser doesn't blow up when the String/Stream contains a BOM, or do I have to detect it, strip it and then pass to it?

Many thanks in advance for any help!

Cheers,
Gus
« Last Edit: March 13, 2021, 07:45:52 pm by gcarreno »
Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

john horst

  • Jr. Member
  • **
  • Posts: 68
    • JHorst
Detect and strip it is the workaround if you have no choice. I would notify who ever you are consuming from that BOM is not valid JSON.

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 1120
  • Professional amateur ;-P
Hi John,

Detect and strip it is the workaround if you have no choice. I would notify who ever you are consuming from that BOM is not valid JSON.

Well, in a sense, no I don't really have a choice since this is a problem I need to solve in my laz-JSON-Viewer, which is meant to be a default JSON viewer.
This means I have to try and parse whatever the user shoots at me.
I'm already popping up a message when the parser emits an Exception.

And after removing the BOM, will TJSONParser be resourceful enough to deal with UTF8 and/or UTF16?

Cheers,
Gus
Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

PascalDragon

  • Hero Member
  • *****
  • Posts: 5481
  • Compiler Developer
Is there a way I can assure that TJSONParser doesn't blow up when the String/Stream contains a BOM, or do I have to detect it, strip it and then pass to it?

The later is the way to go.

And after removing the BOM, will TJSONParser be resourceful enough to deal with UTF8 and/or UTF16?

UTF-8 is the default that the parser assumes. For UTF-16 you'll have to convert it beforehand or maybe it will also work to use a TStringStream with the encoding set correctly (not tested).

MarkMLl

  • Hero Member
  • *****
  • Posts: 6686
Is there a way I can assure that TJSONParser doesn't blow up when the String/Stream contains a BOM, or do I have to detect it, strip it and then pass to it?

The later is the way to go.

In any event, logging all possible information describing the input encoding as early as possible is valuable diagnostic information.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

avk

  • Hero Member
  • *****
  • Posts: 752
Just FYI, the current JSON standard claims that JSON text MUST be encoded using UTF-8.

AlexTP

  • Hero Member
  • *****
  • Posts: 2402
    • UVviewsoft
I made a patch to allow UTF8 BOM
https://bugs.freepascal.org/view.php?id=38607

PascalDragon

  • Hero Member
  • *****
  • Posts: 5481
  • Compiler Developer
Just FYI, the current JSON standard claims that JSON text MUST be encoded using UTF-8.

Doesn't stop users (or devs) from using anything. E.g. if one would pipe output in Windows PowerShell it would be UTF-16 by default and if one doesn't notice it then one would throw a UTF-16 file around...

avk

  • Hero Member
  • *****
  • Posts: 752
It is clear that no standards can prohibit people from doing what they want.
However, using any format without strict adherence to the standard makes no sense.

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 1120
  • Professional amateur ;-P
Hey peeps,

I agree that if I'm producing a default viewer for JSON that I need to comply with the definition, so that means only allow UTF8 with no BOM at the moment.

This also means that I need to advise the user WHY I'm refusing to load his file, which subsequently means that I need to determine the encoding on said file.

Now the question shifts into: Do you know of a good character set detection library?

From my initial, and very superficial, search of the matter on this forum I found this thread message:
that contained the attached file chsdt.zip that is a translation into FPC/Lazarus of chsdet on sourceforge.

Is this any good or is there a more recent body of work on charset detection under FPC/Lazarus?

Cheers,
Gus
Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

avk

  • Hero Member
  • *****
  • Posts: 752
This also means that I need to advise the user WHY I'm refusing to load his file...
I could think of it in the same way that FPC - X is expected at some position in the string, but Y is found. And TJSONParser supports this functionality.
But, as usual, the rabbit hole is much deeper than it looks. Just try passing this test suite to TJSONParser.
A leading 'n' in a filename means it should be rejected, and a leading 'y' means it should be accepted.

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 1120
  • Professional amateur ;-P
Hey avk

I'm really sorry @avk, but I don't really understand what you're proposing me to do. :-\

If I use chsdet I can tell the user:  "JSON is UTF8 and your file is UTF16 with BOM, sorry won't read!"

What does your test suit help me do in terms of messaging the user that I'm not going to visualize his file?

Cheers,
Gus
Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

lucamar

  • Hero Member
  • *****
  • Posts: 4219
However, using any format without strict adherence to the standard makes no sense.

That, as browser builders found to their chagrin, is rather difficult to accomplish. Generally speaking, when a "format" has had time to develop in several directions before a strict standard came about or when it is used for things it never was thought to deal with, one should follow the old IETF principle: be strict with what you emit but liberal with what you accept.

In this case, as with any text-based format, one must be ready to accept as many definitions of "text" as one possibly can, which is not as easy as it sounds. A few forms of text (UCS2, UTF16, ...) can be inferred from a "mark" in the text (the BOM), but with others one has to resort to "wild guesstimation", to byte-frequency analysis or ask the user what the heck he meant. Telling the user: "I don't like that file! Correct it and try again!" should be an absolutely last option reserved only for extremely bad-formed text.

IMHO, of course :-\
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

avk

  • Hero Member
  • *****
  • Posts: 752
What does your test suit help me do in terms of messaging the user that I'm not going to visualize his file?
I was just trying to say in such a clumsy way that even if you determine the encoding of the text, it does not automatically mean that the TJSONParser will be able to parse it correctly.

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 1120
  • Professional amateur ;-P
Hey avk,

I was just trying to say in such a clumsy way that even if you determine the encoding of the text, it does not automatically mean that the TJSONParser will be able to parse it correctly.

Ahhh, get it now and I agree with you. But I'll still have the parse surrounded by a try..except and will report any exception text from said parser to the user.

At the moment if TJSONParser encounters a BOM it will error out with a cryptic "cannot find valid JSON at position 1" or something along this.
If I can eliminate issues with encoding and BOM before parsing, well, that's one less cryptic message, right?

And for that I need a reliable and modern tool for it, hence me asking if anyone has knowledge of it.

Cheers,
Gus
Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

 

TinyPortal © 2005-2018