Recent

Author Topic: TStrings encoding and {$codepage utf8}  (Read 4528 times)

Bart

  • Hero Member
  • *****
  • Posts: 5275
    • Bart en Mariska's Webstek
TStrings encoding and {$codepage utf8}
« on: December 26, 2019, 04:09:59 pm »
Hi,

I asked this on ML, but got no reply, so I'm trying here now.

Consider this code:

Code: Pascal  [Select][+][-]
  1. {$codepage utf8}
  2. {$mode objfpc}
  3. {$H+}
  4.  
  5. uses
  6.   SysUtils, Classes;
  7.  
  8. var
  9.   SL: TSTringList;
  10.   S: String;
  11.  
  12. begin
  13.   writeln('DefaultSystemCodePage = ',DefaultSystemCodePage);
  14.   SL := TStringList.Create;
  15.   {$if fpc_fullversion > 30200}
  16.   SL.WriteBom := False;
  17.   {$endif}
  18.   SL.SkipLastLineBreak := True;
  19.   S := 'ä';  //S has CodePage CP_UTF8
  20.   SL.Add(S);
  21.   SL.SaveToFile('slU.txt'{$if fpc_fullversion > 30200}, TEncoding.UTF8{$endif});
  22.   SL.SaveToFile('slA.txt'{$if fpc_fullversion > 30200}, TEncoding.ANSI{$endif});
  23.   SL.Free;
  24. end.

Tested with fpc trunk (form a few days ago).
It outputs:
Code: [Select]
DefaultSystemCodePage = 1252
(I'm on Windows as you might have guessed)
The file slA.txt contains the bytes C3 A4 (which is ä in UTF8 encoding)
The file slU.txt contains the bytes C3 83 C2 A4

I struggle to understand why.
What is the codepage of the stringlist's internal list of strings (array of TStringItem's)?
It seems that the stringlist considers it's internal TStringItem.FString that has #$C3A#$A4 to  have a codepage of CP_ACP (always)?


I just tested this variant:
In a new unit which does have {$codepage utf8} declare a const like
Code: Pascal  [Select][+][-]
  1. const
  2.   _A8 = 'ä';
In the main sourcefile remove the {$codepage utf8}.
In the uses clause of the main sourcefile add the newly created unit.
Replace the line
Code: Pascal  [Select][+][-]
  1.   S := 'ä';
with
Code: Pascal  [Select][+][-]
  1.   S := _A8;

Now build and run.
The file slA.txt will contain the single byte E4 ('ä' in codepage 1252).
The file slU.txt will conatin the bytes C3 A4 ('ä' in codepage utf8).

Is this a bug?

Bart
« Last Edit: December 26, 2019, 04:23:45 pm by Bart »

jamie

  • Hero Member
  • *****
  • Posts: 6090
Re: TStrings encoding and {$codepage utf8}
« Reply #1 on: December 27, 2019, 02:10:12 pm »
Code page translation looks correct to me..

https://www.ascii-code.com/


Scroll down, these are the standard extended ASCII table..

as for the translation to UTF8, well that could be other issues.
The only true wisdom is knowing you know nothing

Bart

  • Hero Member
  • *****
  • Posts: 5275
    • Bart en Mariska's Webstek
Re: TStrings encoding and {$codepage utf8}
« Reply #2 on: December 27, 2019, 05:58:39 pm »
The issue at hand is that when I tell the stringlist to save it's contents as ANSI, it saves it as UTF8.
When I tell it to save as UTF8 it UTF8Encodes the content that already was UTF8.

Bart

ASerge

  • Hero Member
  • *****
  • Posts: 2222
Re: TStrings encoding and {$codepage utf8}
« Reply #3 on: December 28, 2019, 08:36:11 am »
I think TStrings has anything to do with it. The same effect gives:
Code: Pascal  [Select][+][-]
  1. {$APPTYPE CONSOLE}
  2. {$codepage utf8}
  3. {$mode objfpc}
  4. {$H+}
  5.  
  6. uses SysUtils;
  7.  
  8. procedure WriteBytes(const B: TBytes);
  9. var
  10.   N: Byte;
  11. begin
  12.   Write(Length(B), ': ');
  13.   for N in B do
  14.     Write(HexStr(N, 2));
  15.   Writeln;
  16. end;
  17.  
  18. var
  19.   S: string;
  20.   B: TBytes;
  21. begin
  22.   Writeln('DefaultSystemCodePage = ', DefaultSystemCodePage);
  23.   S := 'ä';  // S has CodePage CP_UTF8
  24.   B := TEncoding.UTF8.GetAnsiBytes(S);
  25.   WriteBytes(B);
  26.   B := TEncoding.ANSI.GetAnsiBytes(S);
  27.   WriteBytes(B);
  28.   Readln;
  29. end.

jamie

  • Hero Member
  • *****
  • Posts: 6090
Re: TStrings encoding and {$codepage utf8}
« Reply #4 on: December 28, 2019, 07:53:22 pm »
Isn't the stringList bulk Read/WRite operation? I don't see why there should be any conversions taking place.
The only true wisdom is knowing you know nothing

Bart

  • Hero Member
  • *****
  • Posts: 5275
    • Bart en Mariska's Webstek
Re: TStrings encoding and {$codepage utf8}
« Reply #5 on: December 29, 2019, 11:59:29 pm »
OK, it seems I misuderstand the meaning of TStrings.Encoding/DefaultEncoding.
It is only used in LoadFrom*/SaveTo*

I would have expected that that when you add a string with a different encoding to TStrings, it would convert that string to the current encoding of TStrings.

It seems you can have strings with different encodings inside TStrings1.
When doing a SaveTo*, the internal strings are not added (like in concatenating strings), but the resulting strings are gobbled together using System.Move, so all codepage information of the individual strings get lost.

1 In the first example also add a the string 'ä', but now read that from the console.
You will have added an UTF8 encode 'ä' and a 1-byte encoded 'ä' to the stringlist.
The StringList.Text wil now have the following bytes: C3 A4 0D 0A E4.
(C3 A4 = UTF8 encoded 'ä', OD OA = LineEnding, E4 = 'ä' in codepage 1252)
This means that when my sourcecode is UTF8 encoded I am unable to save a TStrings as UTF8.

Everything goes well again if I change DefaultSystemCodePage to CP_UTF8.
I fail to see why that is the case though.
In both cases, it will use in fact FDefaultEncoding.GetAnsiBytes to do the encoding and in both cases FDefaultEncoding is the same.

Bart

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: TStrings encoding and {$codepage utf8}
« Reply #6 on: December 30, 2019, 12:45:03 am »
Hi!

TString/TStringslist does not know anything about the encodings of its items.

If you put a UTF8 string in one line and in the next an ANSI encoded then it pulls all that stuff correct together and writes: utf8-ä  DOS-LineEnding ANSI-ä . So what is wrong with that?

You have to take care that the input is encoded right.

If the circumstances does not allow that then have a look a this utf8-tester:

//https://github.com/Alexey-T/ATSynEdit/blob/master/atsynedit/atstringproc_utf8detect.pas 

With that code you can check if the string is utf-8 - but I think it needs more than a single ä. If it is not  utf8 it is maybe ANSI - with an unknown codepage.

If you mix codepages, it is up to you to keep track!

Winni
« Last Edit: December 30, 2019, 12:46:45 am by winni »

Bart

  • Hero Member
  • *****
  • Posts: 5275
    • Bart en Mariska's Webstek
Re: TStrings encoding and {$codepage utf8}
« Reply #7 on: December 30, 2019, 01:17:20 am »
You have to take care that the input is encoded right.

I realize that, and most of my sourcecode does not have a codepage define in it, only the ones that have string constants that are not ASCII.
So, that makes it a bit complicated for me.

Any idea why it works OK with DefaultSystemCodePage = CP_UTF8?

Bart

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: TStrings encoding and {$codepage utf8}
« Reply #8 on: December 30, 2019, 01:46:01 am »
Hi!

From your lineEnding I suppose you are working with windows, but I know there is also a Windows port of iconv on the road.

The idea is: convert your sources with Windows default codepage to utf8 with iconv. The complete Pascal Source is by definition 7-Bit-Ascii. The only speicial chars can be in strings and comments.

iconv  is the swiss knife to convert a codepage to another codepage. The list of the possible codepages for input/output is nearly endless.

The original linux iconv is decribed here:

https://www.tecmint.com/convert-files-to-utf-8-encoding-in-linux/


How to get the windows port  is here:

https://dbaportal.eu/2012/10/24/iconv-for-windows/


Winni

Bart

  • Hero Member
  • *****
  • Posts: 5275
    • Bart en Mariska's Webstek
Re: TStrings encoding and {$codepage utf8}
« Reply #9 on: December 30, 2019, 02:39:59 pm »
My problem is not that I don't know how to convert between codepages, but that TStrings require that you specify a codepage when reading/writing, unless you use DefaultEncoding (which is Windows codepage).

With that in mind I started playing around just to see how this is working.
Which proved to bee a litle counter intuïtive (for me).

I have a few console applications that use a TStringList, and these programs have DefaultSystemCodePage = 1252, but they are supposed to handle textfiles in an encoding agnostic way.
So, I'm trying to figure out how this will behave with the upcoming 3.2 release of the compiler.

Bart

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: TStrings encoding and {$codepage utf8}
« Reply #10 on: December 30, 2019, 03:27:17 pm »
Hi!

I think you can wait for some more releases of the compiler - it wont solve your problem!

Do you know those two functions:
Code: Pascal  [Select][+][-]
  1. function AnsiToUtf8(const s : RawByteString): RawByteString;
  2. function Utf8ToAnsi(const s : RawByteString) : RawByteString;
  3.  
They should solve your problems.

Winni

Bart

  • Hero Member
  • *****
  • Posts: 5275
    • Bart en Mariska's Webstek
Re: TStrings encoding and {$codepage utf8}
« Reply #11 on: December 30, 2019, 11:12:40 pm »
Do you know those two functions:

You are missing the point here.
My program will load some textfile into a TStringList, perform some operation on the strings, then write to file.
The encoding of the textfile is unknown (well, it's not Unicode/WideString).
(None of the operations on the strings are codepage dependant.)

In 3.0.4 this works like a charm.
Will the new TStrings.Encoding / TStrings.DefaultEncoding mess this up?

Hence my tests like above.
Hence my question: why does it work if I set DefaultSystemCodePage to CP_UTF8 (in the first example)?

Mostly I have these questions because Lazarus / Lazaus applications has to face this too.
The current Lazarus solution for this is not optimal and breaks e.g. ODBC connections that expect data in ANSI encoding.
Without the current solution however, almost all Lazarus programs tha use TStrings.LoadFrom*/SaveTo* are broken with fpc 3.2 fixes and trunk.

Bart

 

TinyPortal © 2005-2018