Recent

Author Topic: UnicodeString assignment from string literals seem broken in ObjFpc mode  (Read 1531 times)

Arioch

  • Sr. Member
  • ****
  • Posts: 421
UPD. https://gitlab.com/freepascal.org/fpc/source/-/issues/39923

Win7 / Laz 2.2.3 / FPC 3.2.3

Default LCL project,  {$mode objfpc}{$H+}     

Code: Pascal  [Select][+][-]
  1. procedure TForm1.FormClick(Sender: TObject);
  2. var
  3.   a: ansistring;
  4.   sa, sc: unicodestring;
  5. begin
  6.   a := 'ABCD БЮЖД ABCD';
  7.   sa := a;
  8.   sc := 'ABCD БЮЖД ABCD';
  9.  
  10.   a := IntToStr(StringCodePage( a )) + ' / ' + IntToStr(StringElementSize(a)) + '  ' + a;
  11. //  ShowMessage(a);
  12.   sa := IntToStr(StringCodePage( sa )) + ' / ' + IntToStr(StringElementSize(sa)) + '  ' + sa;
  13. //  ShowMessage(sa);
  14.   sc := IntToStr(StringCodePage( sc )) + ' / ' + IntToStr(StringElementSize(sc)) + '  ' + sc;
  15. //  ShowMessage(sc);
  16.   ShowMessage(a + #13#10 + sa + #13#10 + sc);
  17. end;
  18.  

We should see 3 almost identical lines, right?
No!
« Last Edit: September 26, 2022, 10:03:43 pm by Arioch »

Arioch

  • Sr. Member
  • ****
  • Posts: 421
Re: UnicodeString assignment from string literals seem broken in ObjFpc mode
« Reply #1 on: September 26, 2022, 07:42:09 pm »
and conversions, mostly the same but a small difference in the first one

Code: Pascal  [Select][+][-]
  1.   sa := '';
  2.  
  3.   a := 'ABCD БЮЖД ABCD';
  4.   SetCodePage(RawByteString(a), 1251, True);
  5.   SetCodePage(RawByteString(a), 65001, False);
  6.   a := IntToStr(StringCodePage( a )) + ' / ' + IntToStr(StringElementSize(a)) + '  ' + a;
  7.   sa := sa + a + #13#10;
  8.  
  9.   a := 'ABCD БЮЖД ABCD';
  10.   SetCodePage(RawByteString(a), 1251, True);
  11.   SetCodePage(RawByteString(a), 65001, False);
  12.   SetCodePage(RawByteString(a), 1251, True);
  13.   a := IntToStr(StringCodePage( a )) + ' / ' + IntToStr(StringElementSize(a)) + '  ' + a;
  14.   sa := sa + a + #13#10;
  15.  
  16.   a := 'ABCD БЮЖД ABCD';
  17.   SetCodePage(RawByteString(a), 65001, True);
  18.   SetCodePage(RawByteString(a), 1251, False);
  19.   SetCodePage(RawByteString(a), 65001, True);
  20.   a := IntToStr(StringCodePage( a )) + ' / ' + IntToStr(StringElementSize(a)) + '  ' + a;
  21.   sa := sa + a + #13#10;
  22.  
  23.   ShowMessage(sa);
  24.  

ASerge

  • Hero Member
  • *****
  • Posts: 2242
Re: UnicodeString assignment from string literals seem broken in ObjFpc mode
« Reply #2 on: September 26, 2022, 08:36:56 pm »
We should see 3 almost identical lines, right?
No!
By default, the source code is considered to be in the OS encoding. Therefore, the compiler considers a string constant in the OS encoding and stores it in the executable file in this form, although the original file contains it in UTF8 encoding. During execution, a transformation occurs.
You can change the compiler's "opinion" in any of two ways:
1. Save the file in UTF with BOM (via context menu)
2. Add an explicit directive {$CODEPAGE UTF8}

Arioch

  • Sr. Member
  • ****
  • Posts: 421
Re: UnicodeString assignment from string literals seem broken in ObjFpc mode
« Reply #3 on: September 26, 2022, 08:49:43 pm »
string constant in the OS encoding and stores it in the executable file in this form

1. No. EXE file has those strings in UTF-8 not in windows-1251

2. Would it be so, then both ansistring (UTF-8 in LCL)  and unicodestrings (UTF-16) would be uniformly damaged in one mode, and uniformly fixed in another

We see there the compiler either fails to properly decorate string literals or fails to call encoding-check when assigning to literals...

Jonas Maebe

  • Hero Member
  • *****
  • Posts: 1059
Re: UnicodeString assignment from string literals seem broken in ObjFpc mode
« Reply #4 on: September 26, 2022, 09:25:06 pm »
Delphi versions with string = unicodestring interpret constant strings in your source file according to the default code page of the machine you are compiling on. If you want to have this behaviour in FPC, add {$modeswitch systemcodepage} (_after_ any {$mode xxx} directives), or use {$mode delphiunicode} (this also changes the string into an alias for unicodestring).

Arioch

  • Sr. Member
  • ****
  • Posts: 421
Re: UnicodeString assignment from string literals seem broken in ObjFpc mode
« Reply #5 on: September 26, 2022, 09:34:21 pm »
Delphi versions with string = unicodestring interpret constant strings in your source file according to the default code page of the machine you are compiling on.

All Delphi versions do. As long as you did not set source files to be UTF-8.

So does Lazarus.

Otherwise the source editor would be unable to render fonts properly.

Delphi 2009+ then, for what i noticed, would store both constants in EXE - ansi and unicode. It doees not merge string literalls across encoding.
Perhaps, so should do FPC

add {$modeswitch systemcodepage} (_after_ any {$mode xxx} directives)

Oh, now the strings are uniform indeed - they all are broken :-)

Arioch

  • Sr. Member
  • ****
  • Posts: 421
Re: UnicodeString assignment from string literals seem broken in ObjFpc mode
« Reply #6 on: September 26, 2022, 09:41:05 pm »
i did not change Laz defaults, so it is UTF-8

also, UTF-8 does NOT need BOM, becase it is byte-oriented format.
BOM is needed for UCS2 / UTF-16 and perhaps for UCS4 and UTF-32


Sources have UTF-8, as promiced by IDE

HMMM...  EXE does contain both literals in FPC too

But the second one has double-conversion!!!

It mistreated UTF-8 sources for ANSI codepage, and then applied wrong conversion, 1251 -> UTF16 instead of correct UTF8 -> UTF16







Jonas Maebe

  • Hero Member
  • *****
  • Posts: 1059
Re: UnicodeString assignment from string literals seem broken in ObjFpc mode
« Reply #7 on: September 26, 2022, 09:49:06 pm »
Delphi versions with string = unicodestring interpret constant strings in your source file according to the default code page of the machine you are compiling on.

All Delphi versions do. As long as you did not set source files to be UTF-8.
The different is that previous Delphi versions did not store the code page of the ansistring constant in the binary (because ansistrings did not have a code page field). As a result, regardless of the default code page of the machine on which the source code was compiled, at run time an ansistring would always be interpreted according to the code page of the machine the executable was running on. This changed with Delphi 2009, because it now adds this code page of the compiling machine to the ansistring constants. As a result, the behaviour of an executable no longer depends on which machine it runs on, but on which machine it was compiled on (which I don't think is that great of an improvement).

Quote
So does Lazarus.

Otherwise the source editor would be unable to render fonts properly.
By default, FPC (the compiler used by Lazarus) stores "0" (CP_ACP) as the code page for an ansistring (as your first test shows). This results in behaviour compatible with previous FPC versions and also with Delphi < 2009.

Quote
add {$modeswitch systemcodepage} (_after_ any {$mode xxx} directives)

Oh, now the strings are uniform indeed - they all are broken :-)
I saw you figured this one out already in the mean time. See also https://wiki.freepascal.org/Unicode_Support_in_Lazarus (and https://wiki.freepascal.org/FPC_Unicode_support for the FPC side)

Jonas Maebe

  • Hero Member
  • *****
  • Posts: 1059
Re: UnicodeString assignment from string literals seem broken in ObjFpc mode
« Reply #8 on: September 26, 2022, 09:58:21 pm »
i did not change Laz defaults, so it is UTF-8

also, UTF-8 does NOT need BOM, becase it is byte-oriented format.
BOM is needed for UCS2 / UTF-16 and perhaps for UCS4 and UTF-32


Sources have UTF-8, as promiced by IDE

HMMM...  EXE does contain both literals in FPC too

But the second one has double-conversion!!!

It mistreated UTF-8 sources for ANSI codepage, and then applied wrong conversion, 1251 -> UTF16 instead of correct UTF8 -> UTF16
That's because {$modeswitch systemcodepage} forces the compiler to interpret all constant strings in your source file as being encoded in the system code page (which is presumably 1251 in your case). The utf8 encoding gets specified on the command line by lazarus, but this directive overrides it (since it appears inside the source file, and hence is parsed after the command line options). So this is working as intended, and it seems something else is going on after all in your initial program (I forgot about Lazarus defaulting to utf-8 everywhere, I'm not a Lazarus user but a compiler developer).

Jonas Maebe

  • Hero Member
  • *****
  • Posts: 1059
Re: UnicodeString assignment from string literals seem broken in ObjFpc mode
« Reply #9 on: September 26, 2022, 10:03:10 pm »
Code: Pascal  [Select][+][-]
  1. {$mode objfpc}{$h+}
  2.  
  3. {$codepage utf8}
  4.  
  5. uses
  6.   {$ifdef unix}cwstring,{$endif}sysutils;
  7.  
  8. procedure test;
  9. var
  10.   a: ansistring;
  11.   sa, sc: unicodestring;
  12. begin
  13.   a := 'ABCD БЮЖД ABCD';
  14.   sa := a;
  15.   sc := 'ABCD БЮЖД ABCD';
  16.  
  17.   a := IntToStr(StringCodePage( a )) + ' / ' + IntToStr(StringElementSize(a)) + '  ' + a;
  18. //  ShowMessage(a);
  19.   sa := IntToStr(StringCodePage( sa )) + ' / ' + IntToStr(StringElementSize(sa)) + '  ' + sa;
  20. //  ShowMessage(sa);
  21.   sc := IntToStr(StringCodePage( sc )) + ' / ' + IntToStr(StringElementSize(sc)) + '  ' + sc;
  22. //  ShowMessage(sc);
  23.   writeln(a);
  24.   writeln(sa);
  25.   writeln(sc);
  26. end;
  27.  
  28. begin
  29.   test;
  30. end.
  31.  
This program prints the following for me. So it seems the issue related to the interaction with the LCL. The string codepage 0 for sa is weird though.

Arioch

  • Sr. Member
  • ****
  • Posts: 421
Re: UnicodeString assignment from string literals seem broken in ObjFpc mode
« Reply #10 on: September 26, 2022, 10:03:19 pm »
By default, FPC (the compiler used by Lazarus) stores "0" (CP_ACP) as the code page for an ansistring (as your first test shows).

Frankly, the default LCL setup pretending CP_ACP = CP_UTF8 breaks a number of standard coding patterns.
Last time i was hit by it was debugging Windows GDI code in Castle Game Engine.
The normal unicodestring := ansistring conversion was broken there,

So i think this LCL choice was never correct and was not even most practical. But - what was done is done.

However, for UTF-16 strings the encoding is different (and frankly, should be ignored as long as "char size = 2" is detected.

It is not wrong charset tag which is stored - it is rong data stream, wrong bytes representation is put instead of UTF-16 stream.

Arioch

  • Sr. Member
  • ****
  • Posts: 421
Re: UnicodeString assignment from string literals seem broken in ObjFpc mode
« Reply #11 on: September 26, 2022, 10:11:00 pm »
This program prints the following for me. So it seems the issue related to the interaction with the LCL. The string codepage 0 for sa is weird though.

is it? or is it about the back-end? i do not think `strings` would show both UTF-8 and UTF-16 streams in your ELF, but who knows...

Arrrgh!!!

I didn't see it, i am not interested in it.

Arioch

  • Sr. Member
  • ****
  • Posts: 421
Re: UnicodeString assignment from string literals seem broken in ObjFpc mode
« Reply #12 on: September 26, 2022, 10:11:44 pm »
kludges do not work too

Arioch

  • Sr. Member
  • ****
  • Posts: 421
Re: UnicodeString assignment from string literals seem broken in ObjFpc mode
« Reply #13 on: September 26, 2022, 10:15:08 pm »
notice the same UTF-8 just dilluted with zero-bytes.

It would be correct for Latin1 to UCS2 quick hack conversion, would be more or less usually correct for West-European cp1250 to UTF-16, but NOT for UTF-8

Jonas Maebe

  • Hero Member
  • *****
  • Posts: 1059
Re: UnicodeString assignment from string literals seem broken in ObjFpc mode
« Reply #14 on: September 26, 2022, 10:21:22 pm »
By default, FPC (the compiler used by Lazarus) stores "0" (CP_ACP) as the code page for an ansistring (as your first test shows).

Frankly, the default LCL setup pretending CP_ACP = CP_UTF8 breaks a number of standard coding patterns.
Last time i was hit by it was debugging Windows GDI code in Castle Game Engine.
The normal unicodestring := ansistring conversion was broken there,

So i think this LCL choice was never correct and was not even most practical. But - what was done is done.
It was done because it was the only way to keep backward compatibility with previous code (which Delphi didn't care about).

Quote
However, for UTF-16 strings the encoding is different (and frankly, should be ignored as long as "char size = 2" is detected.
The code page in an utf-16 string is ignored by the FPC rtl. However, if you assing a constant string in the sourcefile to a unicodestring, and the source file is not written in UTF-16 (I think we don't even support UTF-16 source files), then the compiler must either
a) store the string constant as it appeared in the source file, and convert it at run time to UTF-16, or
b) convert it to UTF-16 at compile time

The compiler nowadays does b), and it's this part that's going wrong if the code page of the source file is not explicitly specified. See https://wiki.freepascal.org/FPC_Unicode_support#String_constants
Quote
The compiler has to know the code page according to which it should interpret string constants, as it may have to convert them at compile time. Normally, a string constant is interpreted according to the source file codepage. If the source file codepage is CP_ACP, a default is used instead: in that case, during conversions the constant strings are assumed to have code page 28591 (ISO 8859-1 Latin 1; Western European).

To be sure: in what code page is your source file that you are compiling, both in Delphi and in Lazarus?

 

TinyPortal © 2005-2018