[SOLVED]Is there a way for TJSONParser to deal with a String/Stream having a BOM

Gustavo 'Gus' Carreno

Hero Member
Posts: 1120
Professional amateur ;-P

[SOLVED]Is there a way for TJSONParser to deal with a String/Stream having a BOM

« on: March 08, 2021, 02:18:59 pm »

Hi there,

Is there a way I can assure that TJSONParser doesn't blow up when the String/Stream contains a BOM, or do I have to detect it, strip it and then pass to it?

Many thanks in advance for any help!

Cheers,
Gus

« Last Edit: March 13, 2021, 07:45:52 pm by gcarreno »

Logged

Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

john horst

Jr. Member
Posts: 68

Re: Is there a way for TJSONParser to deal with a String/Stream having a BOM

« Reply #1 on: March 08, 2021, 05:00:51 pm »

Detect and strip it is the workaround if you have no choice. I would notify who ever you are consuming from that BOM is not valid JSON.

Logged

Gustavo 'Gus' Carreno

Hero Member
Posts: 1120
Professional amateur ;-P

Re: Is there a way for TJSONParser to deal with a String/Stream having a BOM

« Reply #2 on: March 08, 2021, 05:34:41 pm »

Hi John,

Quote from: john horst on March 08, 2021, 05:00:51 pm

Detect and strip it is the workaround if you have no choice. I would notify who ever you are consuming from that BOM is not valid JSON.

Well, in a sense, no I don't really have a choice since this is a problem I need to solve in my laz-JSON-Viewer, which is meant to be a default JSON viewer.
This means I have to try and parse whatever the user shoots at me.
I'm already popping up a message when the parser emits an Exception.

And after removing the BOM, will TJSONParser be resourceful enough to deal with UTF8 and/or UTF16?

Cheers,
Gus

Logged

Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

PascalDragon

Hero Member
Posts: 5481
Compiler Developer

Re: Is there a way for TJSONParser to deal with a String/Stream having a BOM

« Reply #3 on: March 09, 2021, 09:23:45 am »

Quote from: gcarreno on March 08, 2021, 02:18:59 pm

Is there a way I can assure that TJSONParser doesn't blow up when the String/Stream contains a BOM, or do I have to detect it, strip it and then pass to it?

The later is the way to go.

Quote from: gcarreno on March 08, 2021, 05:34:41 pm

And after removing the BOM, will TJSONParser be resourceful enough to deal with UTF8 and/or UTF16?

UTF-8 is the default that the parser assumes. For UTF-16 you'll have to convert it beforehand or maybe it will also work to use a TStringStream with the encoding set correctly (not tested).

Logged

MarkMLl

Hero Member
Posts: 6686

Re: Is there a way for TJSONParser to deal with a String/Stream having a BOM

« Reply #4 on: March 09, 2021, 09:45:16 am »

Quote from: PascalDragon on March 09, 2021, 09:23:45 am

Quote from: gcarreno on March 08, 2021, 02:18:59 pm
Is there a way I can assure that TJSONParser doesn't blow up when the String/Stream contains a BOM, or do I have to detect it, strip it and then pass to it?

The later is the way to go.

In any event, logging all possible information describing the input encoding as early as possible is valuable diagnostic information.

MarkMLl

Logged

MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

avk

Hero Member
Posts: 752

Re: Is there a way for TJSONParser to deal with a String/Stream having a BOM

« Reply #5 on: March 09, 2021, 10:02:15 am »

Just FYI, the current JSON standard claims that JSON text MUST be encoded using UTF-8.

Logged

AlexTP

Hero Member
Posts: 2402

Re: Is there a way for TJSONParser to deal with a String/Stream having a BOM

« Reply #6 on: March 09, 2021, 10:31:58 am »

I made a patch to allow UTF8 BOM
https://bugs.freepascal.org/view.php?id=38607

Logged

CudaText editor - ATSynEdit - More from me

PascalDragon

Hero Member
Posts: 5481
Compiler Developer

Re: Is there a way for TJSONParser to deal with a String/Stream having a BOM

« Reply #7 on: March 09, 2021, 01:38:15 pm »

Quote from: avk on March 09, 2021, 10:02:15 am

Just FYI, the current JSON standard claims that JSON text MUST be encoded using UTF-8.

Doesn't stop users (or devs) from using anything. E.g. if one would pipe output in Windows PowerShell it would be UTF-16 by default and if one doesn't notice it then one would throw a UTF-16 file around...

Logged

avk

Hero Member
Posts: 752

Re: Is there a way for TJSONParser to deal with a String/Stream having a BOM

« Reply #8 on: March 09, 2021, 01:58:13 pm »

It is clear that no standards can prohibit people from doing what they want.
However, using any format without strict adherence to the standard makes no sense.

Logged

Gustavo 'Gus' Carreno

Hero Member
Posts: 1120
Professional amateur ;-P

Re: Is there a way for TJSONParser to deal with a String/Stream having a BOM

« Reply #9 on: March 10, 2021, 04:46:53 pm »

Hey peeps,

I agree that if I'm producing a default viewer for JSON that I need to comply with the definition, so that means only allow UTF8 with no BOM at the moment.

This also means that I need to advise the user WHY I'm refusing to load his file, which subsequently means that I need to determine the encoding on said file.

Now the question shifts into: Do you know of a good character set detection library?

From my initial, and very superficial, search of the matter on this forum I found this thread message:

GuessEncoding and CovertEncoding, with SDF Dataset, on Linux

that contained the attached file chsdt.zip that is a translation into FPC/Lazarus of chsdet on sourceforge.

Is this any good or is there a more recent body of work on charset detection under FPC/Lazarus?

Cheers,
Gus

Logged

Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

avk

Hero Member
Posts: 752

Re: Is there a way for TJSONParser to deal with a String/Stream having a BOM

« Reply #10 on: March 10, 2021, 08:23:47 pm »

Quote from: gcarreno on March 10, 2021, 04:46:53 pm

This also means that I need to advise the user WHY I'm refusing to load his file...

I could think of it in the same way that FPC - X is expected at some position in the string, but Y is found. And TJSONParser supports this functionality.
But, as usual, the rabbit hole is much deeper than it looks. Just try passing this test suite to TJSONParser.
A leading 'n' in a filename means it should be rejected, and a leading 'y' means it should be accepted.

Logged

Gustavo 'Gus' Carreno

Hero Member
Posts: 1120
Professional amateur ;-P

Re: Is there a way for TJSONParser to deal with a String/Stream having a BOM

« Reply #11 on: March 10, 2021, 10:19:38 pm »

Hey avk

I'm really sorry @avk, but I don't really understand what you're proposing me to do. $:-\$

If I use chsdet I can tell the user: "JSON is UTF8 and your file is UTF16 with BOM, sorry won't read!"

What does your test suit help me do in terms of messaging the user that I'm not going to visualize his file?

Cheers,
Gus

Logged

Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

lucamar

Hero Member
Posts: 4219

Re: Is there a way for TJSONParser to deal with a String/Stream having a BOM

« Reply #12 on: March 11, 2021, 12:16:16 am »

Quote from: avk on March 09, 2021, 01:58:13 pm

However, using any format without strict adherence to the standard makes no sense.

That, as browser builders found to their chagrin, is rather difficult to accomplish. Generally speaking, when a "format" has had time to develop in several directions before a strict standard came about or when it is used for things it never was thought to deal with, one should follow the old IETF principle: be strict with what you emit but liberal with what you accept.

In this case, as with any text-based format, one must be ready to accept as many definitions of "text" as one possibly can, which is not as easy as it sounds. A few forms of text (UCS2, UTF16, ...) can be inferred from a "mark" in the text (the BOM), but with others one has to resort to "wild guesstimation", to byte-frequency analysis or ask the user what the heck he meant. Telling the user: "I don't like that file! Correct it and try again!" should be an absolutely last option reserved only for extremely bad-formed text.

IMHO, of course $:-\$

Logged

Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!)

Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

avk

Hero Member
Posts: 752

Re: Is there a way for TJSONParser to deal with a String/Stream having a BOM

« Reply #13 on: March 11, 2021, 08:48:05 am »

Quote from: gcarreno on March 10, 2021, 10:19:38 pm

What does your test suit help me do in terms of messaging the user that I'm not going to visualize his file?

I was just trying to say in such a clumsy way that even if you determine the encoding of the text, it does not automatically mean that the TJSONParser will be able to parse it correctly.

Logged

Gustavo 'Gus' Carreno

Hero Member
Posts: 1120
Professional amateur ;-P

Re: Is there a way for TJSONParser to deal with a String/Stream having a BOM

« Reply #14 on: March 11, 2021, 02:21:24 pm »

Hey avk,

Quote from: avk on March 11, 2021, 08:48:05 am

I was just trying to say in such a clumsy way that even if you determine the encoding of the text, it does not automatically mean that the TJSONParser will be able to parse it correctly.

Ahhh, get it now and I agree with you. But I'll still have the parse surrounded by a try..except and will report any exception text from said parser to the user.

At the moment if TJSONParser encounters a BOM it will error out with a cryptic "cannot find valid JSON at position 1" or something along this.
If I can eliminate issues with encoding and BOM before parsing, well, that's one less cryptic message, right?

And for that I need a reliable and modern tool for it, hence me asking if anyone has knowledge of it.

Cheers,
Gus

Logged

Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

Lazarus

Bookstore

Search

Recent

Author Topic: [SOLVED]Is there a way for TJSONParser to deal with a String/Stream having a BOM (Read 6327 times)

Gustavo 'Gus' Carreno

[SOLVED]Is there a way for TJSONParser to deal with a String/Stream having a BOM

john horst

Re: Is there a way for TJSONParser to deal with a String/Stream having a BOM

Gustavo 'Gus' Carreno

Re: Is there a way for TJSONParser to deal with a String/Stream having a BOM

PascalDragon

Re: Is there a way for TJSONParser to deal with a String/Stream having a BOM

MarkMLl

Re: Is there a way for TJSONParser to deal with a String/Stream having a BOM

avk

Re: Is there a way for TJSONParser to deal with a String/Stream having a BOM

AlexTP

Re: Is there a way for TJSONParser to deal with a String/Stream having a BOM

PascalDragon

Re: Is there a way for TJSONParser to deal with a String/Stream having a BOM

avk

Re: Is there a way for TJSONParser to deal with a String/Stream having a BOM

Gustavo 'Gus' Carreno

Re: Is there a way for TJSONParser to deal with a String/Stream having a BOM

avk

Re: Is there a way for TJSONParser to deal with a String/Stream having a BOM

Gustavo 'Gus' Carreno

Re: Is there a way for TJSONParser to deal with a String/Stream having a BOM

lucamar

Re: Is there a way for TJSONParser to deal with a String/Stream having a BOM

avk

Re: Is there a way for TJSONParser to deal with a String/Stream having a BOM

Gustavo 'Gus' Carreno

Re: Is there a way for TJSONParser to deal with a String/Stream having a BOM

	Computer Math and Games in Pascal (preview)
	Lazarus Handbook