Recent

Author Topic: Treating UTF-8 BOM in TCSVDataSet  (Read 858 times)

egsuh

  • Hero Member
  • *****
  • Posts: 1712
Treating UTF-8 BOM in TCSVDataSet
« on: November 18, 2025, 08:35:40 am »
I think this is not applicable only to TCSVDataSet. Anyway TCSVDataSet.LoadFromCSVFile causes problems when the CSV file is encoded as UTF-8 with BOM.  I can change the encoding using TextEditor, etc., but isn't it possible to rewrite the procedure itself as remove BOM characters?   

dbannon

  • Hero Member
  • *****
  • Posts: 3614
    • tomboy-ng, a rewrite of the classic Tomboy
Re: Treating UTF-8 BOM in TCSVDataSet
« Reply #1 on: November 18, 2025, 08:45:32 am »
Is the problem the UTF8 characters in the dataset or just the BOM itself ?

My app reads and writes some xml files, I had a problem (cannot remember details) but ended up removing the BOM, it is "optional" and my LCL code is quite happy to read UTF8 without it.

Davo   
Lazarus 3, Linux (and reluctantly Win10/11, OSX Monterey)
My Project - https://github.com/tomboy-notes/tomboy-ng and my github - https://github.com/davidbannon

egsuh

  • Hero Member
  • *****
  • Posts: 1712
Re: Treating UTF-8 BOM in TCSVDataSet
« Reply #2 on: November 18, 2025, 09:06:41 am »
Quote
Is the problem the UTF8 characters in the dataset or just the BOM itself ?

Just the BOM itself. The first field name is not recognized correctly. When I remove BOM programmatically, then there are no problems.

Thausand

  • Sr. Member
  • ****
  • Posts: 445
Re: Treating UTF-8 BOM in TCSVDataSet
« Reply #3 on: November 18, 2025, 09:28:13 am »
May be try work round for test:
- have TStringList.Create
- StringList.LoadFromFile('CSV')
- have TMemoryStream.Create
- StringList.WriteBOM = false
- StringList.SaveToStream(MemoryStream)
- StringList.Free
- MemoryStream.Position = 0
- TCSVDataset.LoadFromStream(MemoryStream)
- MemoryStream.Free

Khrys

  • Sr. Member
  • ****
  • Posts: 367
Re: Treating UTF-8 BOM in TCSVDataSet
« Reply #4 on: November 18, 2025, 09:31:43 am »
Where do these CSVs come from? UTF-8 with BOM is Microsoft nonsense and shouldn't even exist in the first place - the root cause are the programs that output such rubbish.

[...] but isn't it possible to rewrite the procedure itself as remove BOM characters?

You could extend  TCustomCSVDataset  and override some of the protected virtual methods...

egsuh

  • Hero Member
  • *****
  • Posts: 1712
Re: Treating UTF-8 BOM in TCSVDataSet
« Reply #5 on: November 18, 2025, 09:39:11 am »
Quote
Where do these CSVs come from?

Saving Excel file to .csv.  LibreOffice does not add BOMs.

Thausand

  • Sr. Member
  • ****
  • Posts: 445
Re: Treating UTF-8 BOM in TCSVDataSet
« Reply #6 on: November 18, 2025, 09:43:28 am »
Saving Excel file to .csv.  LibreOffice does not add BOMs.
I no have windows and no have excel then no can test self: https://usercomp.com/news/1464236/save-csv-utf-8-without-bom

BrunoK

  • Hero Member
  • *****
  • Posts: 756
  • Retired programmer
Re: Treating UTF-8 BOM in TCSVDataSet
« Reply #7 on: November 18, 2025, 12:14:06 pm »
I dont know the current status of default windows encoding, but the BOM specifies for UTF-8 and UTF-16, at least if it is UTF-8, or UTF-16 little_endian/big_endian. That can be detected via program if there are many #0 in the stream and for endianness wether the #0 are at odd/even position in the stream. Works in general.

Apparently, on mac (according to Wiki), a BOM must be written when not saved in UTF-8 or so I understood, correctly or wrongly...

LeP

  • Jr. Member
  • **
  • Posts: 71
Re: Treating UTF-8 BOM in TCSVDataSet
« Reply #8 on: November 18, 2025, 12:46:59 pm »
............... UTF-8 with BOM is Microsoft nonsense and shouldn't even exist in the first place - the root cause are the programs that output such rubbish.

I don't think a BOM is a nonsense ... it's like other kind of files that may have different types of content.
Thinks about graphic files (jpeg, png, bmp) or audio/video files that have a header (FOURCC) at the begin of the file.

BOM needs to identify correctly the "type of characters" (... ANSI, UTF-8, UTF-16, UTF32, LE/BE) are inside a TEXT files, and I think that should be imperative.

A BOM is needed in Windows if you want use a text file coded in UTF-8 with Excel (for example).

But, find if a text files has a BOM is simple. May be all component that use TEXT files should check for a BOM.

egsuh

  • Hero Member
  • *****
  • Posts: 1712
Re: Treating UTF-8 BOM in TCSVDataSet
« Reply #9 on: November 19, 2025, 04:40:44 am »
Quote
May be all component that use TEXT files should check for a BOM.

This is what I'm concerned. Removing BOM characters is simple.

Khrys

  • Sr. Member
  • ****
  • Posts: 367
Re: Treating UTF-8 BOM in TCSVDataSet
« Reply #10 on: November 19, 2025, 07:03:05 am »
............... UTF-8 with BOM is Microsoft nonsense and shouldn't even exist in the first place - the root cause are the programs that output such rubbish.

I don't think a BOM is a nonsense ... it's like other kind of files that may have different types of content.
Thinks about graphic files (jpeg, png, bmp) or audio/video files that have a header (FOURCC) at the begin of the file.

BOM needs to identify correctly the "type of characters" (... ANSI, UTF-8, UTF-16, UTF32, LE/BE) are inside a TEXT files, and I think that should be imperative.

That's what a magic number is for, not a byte order mark (which by itself is nonsense in UTF-8 since there is no endianness to be concerned about). And the BOM fails to do even that (file type identification) - quoting the post I originally linked:

Quote from: jpsecher on stackoverflow.com
  • It is not sufficient because an arbitrary byte sequence can happen to start with the exact sequence that constitutes the BOM.
  • It is not necessary because you can just read the bytes as if they were UTF-8; if that succeeds, it is, by definition, valid UTF-8.

While I do think that  TCSVDataSet  should be able to handle this, it's a pity that Microsoft forced yet another plague onto us programmers.

LeP

  • Jr. Member
  • **
  • Posts: 71
Re: Treating UTF-8 BOM in TCSVDataSet
« Reply #11 on: November 19, 2025, 09:01:02 am »
....
That's what a magic number is for, not a byte order mark (which by itself is nonsense in UTF-8 since there is no endianness to be concerned about). And the BOM fails to do even that (file type identification) - quoting the post I originally linked:
No, this is not only for byte order (LE / BE), but for type of (UTF-8, UFT-16. UTF-32) too. That is, UTF-8 is one of the code, NOT THE ONLY CODE.

Quote from: jpsecher on stackoverflow.com
  • It is not sufficient because an arbitrary byte sequence can happen to start with the exact sequence that constitutes the BOM.
  • It is not necessary because you can just read the bytes as if they were UTF-8; if that succeeds, it is, by definition, valid UTF-8.
While I do think that  TCSVDataSet  should be able to handle this, it's a pity that Microsoft forced yet another plague onto us programmers.

Speaking about chars, of course in a text files should be only text chars .....  ::) .... but since Unicode now is not true and in a text file now can be present any "byte". This is a good reasons to ALWAYS use a BOM !!!
And don't blame Microsoft, blame who thought about "TCSVDataSet" ....

dbannon

  • Hero Member
  • *****
  • Posts: 3614
    • tomboy-ng, a rewrite of the classic Tomboy
Re: Treating UTF-8 BOM in TCSVDataSet
« Reply #12 on: November 19, 2025, 11:35:44 pm »
wikipedia describes a cvs file -

 (CSV) is a plain text data format for storing ...simplicity of use and human readability.


The BOM is 'invisible', does not belong in a plain text file. Cannot seen with the, e.g. cat or type command, so is hardly human readable.

Davo
Lazarus 3, Linux (and reluctantly Win10/11, OSX Monterey)
My Project - https://github.com/tomboy-notes/tomboy-ng and my github - https://github.com/davidbannon

cdbc

  • Hero Member
  • *****
  • Posts: 2522
    • http://www.cdbc.dk
Re: Treating UTF-8 BOM in TCSVDataSet
« Reply #13 on: November 20, 2025, 07:18:02 am »
Hi
Back in the day, I used to use the following to load textfiles into stringlists:
Code: Pascal  [Select][+][-]
  1. var
  2.   bcUTF8bom: string = #$ef#$bb#$bf;
  3.   ...
  4.  
  5. procedure bcLoadFromFileUTF8(aStrings: TStrings; const aFilename: string);
  6. var
  7.   Fs: TFileStream;
  8.   Buf: string;
  9.   Cp: word;
  10. begin
  11.   Fs:= TFileStream.Create(aFilename,fmOpenRead or fmShareDenyNone);
  12.   try
  13.     SetLength({%H-}Buf,3);
  14.     Fs.Read(Buf[1],3);
  15.     if Buf = bcUTF8bom then begin
  16.       aStrings.LoadFromStream(Fs);
  17.     end else begin
  18.       Fs.Seek(0,soFromBeginning);
  19.       aStrings.LoadFromStream(Fs);
  20.       Cp:= bcGuessEncoding(aStrings.Text); // similar to the 1 in LazUTF8.pas
  21.       aStrings.Text:= bcEncodeUTF8(aStrings.Text,Cp); // similar to LConvEncoding
  22.     end;
  23.   finally Fs.Free; end;
  24. end; { bcLoadFromFileUTF8 }
  25.  
  26. procedure bcSaveToFileUTF8(aStrings: TStrings; const aFilename: string);
  27. var
  28.   Fs: TFileStream;
  29. begin
  30.   Fs:= TFileStream.Create(aFilename,fmCreate or fmOpenWrite);
  31.   try
  32.     Fs.Write(bcUTF8bom[1],length(bcUTF8bom));
  33.     aStrings.SaveToStream(Fs); // only utf8 encoding from app
  34.   finally Fs.Free; end;
  35. end; { bcSaveToFileUTF8 }
  36.  
Maybe you can tweak it to your use-case...
Regards Benny
If it ain't broke, don't fix it ;)
PCLinuxOS(rolling release) 64bit -> KDE6/QT6 -> FPC Release -> Lazarus Release &  FPC Main -> Lazarus Main

 

TinyPortal © 2005-2018