Forum > General

UnicodeString assignment from string literals seem broken in ObjFpc mode

<< < (2/5) > >>

Arioch:

--- Quote from: Jonas Maebe on September 26, 2022, 09:25:06 pm ---Delphi versions with string = unicodestring interpret constant strings in your source file according to the default code page of the machine you are compiling on.

--- End quote ---

All Delphi versions do. As long as you did not set source files to be UTF-8.

So does Lazarus.

Otherwise the source editor would be unable to render fonts properly.

Delphi 2009+ then, for what i noticed, would store both constants in EXE - ansi and unicode. It doees not merge string literalls across encoding.
Perhaps, so should do FPC


--- Quote from: Jonas Maebe on September 26, 2022, 09:25:06 pm --- add {$modeswitch systemcodepage} (_after_ any {$mode xxx} directives)

--- End quote ---

Oh, now the strings are uniform indeed - they all are broken :-)

Arioch:
i did not change Laz defaults, so it is UTF-8

also, UTF-8 does NOT need BOM, becase it is byte-oriented format.
BOM is needed for UCS2 / UTF-16 and perhaps for UCS4 and UTF-32

Sources have UTF-8, as promiced by IDE

HMMM...  EXE does contain both literals in FPC too

But the second one has double-conversion!!!

It mistreated UTF-8 sources for ANSI codepage, and then applied wrong conversion, 1251 -> UTF16 instead of correct UTF8 -> UTF16






Jonas Maebe:

--- Quote from: Arioch on September 26, 2022, 09:34:21 pm ---
--- Quote from: Jonas Maebe on September 26, 2022, 09:25:06 pm ---Delphi versions with string = unicodestring interpret constant strings in your source file according to the default code page of the machine you are compiling on.

--- End quote ---

All Delphi versions do. As long as you did not set source files to be UTF-8.

--- End quote ---
The different is that previous Delphi versions did not store the code page of the ansistring constant in the binary (because ansistrings did not have a code page field). As a result, regardless of the default code page of the machine on which the source code was compiled, at run time an ansistring would always be interpreted according to the code page of the machine the executable was running on. This changed with Delphi 2009, because it now adds this code page of the compiling machine to the ansistring constants. As a result, the behaviour of an executable no longer depends on which machine it runs on, but on which machine it was compiled on (which I don't think is that great of an improvement).


--- Quote ---So does Lazarus.

Otherwise the source editor would be unable to render fonts properly.

--- End quote ---
By default, FPC (the compiler used by Lazarus) stores "0" (CP_ACP) as the code page for an ansistring (as your first test shows). This results in behaviour compatible with previous FPC versions and also with Delphi < 2009.


--- Quote ---
--- Quote from: Jonas Maebe on September 26, 2022, 09:25:06 pm --- add {$modeswitch systemcodepage} (_after_ any {$mode xxx} directives)

--- End quote ---

Oh, now the strings are uniform indeed - they all are broken :-)

--- End quote ---
I saw you figured this one out already in the mean time. See also https://wiki.freepascal.org/Unicode_Support_in_Lazarus (and https://wiki.freepascal.org/FPC_Unicode_support for the FPC side)

Jonas Maebe:

--- Quote from: Arioch on September 26, 2022, 09:41:05 pm ---i did not change Laz defaults, so it is UTF-8

also, UTF-8 does NOT need BOM, becase it is byte-oriented format.
BOM is needed for UCS2 / UTF-16 and perhaps for UCS4 and UTF-32

Sources have UTF-8, as promiced by IDE

HMMM...  EXE does contain both literals in FPC too

But the second one has double-conversion!!!

It mistreated UTF-8 sources for ANSI codepage, and then applied wrong conversion, 1251 -> UTF16 instead of correct UTF8 -> UTF16

--- End quote ---
That's because {$modeswitch systemcodepage} forces the compiler to interpret all constant strings in your source file as being encoded in the system code page (which is presumably 1251 in your case). The utf8 encoding gets specified on the command line by lazarus, but this directive overrides it (since it appears inside the source file, and hence is parsed after the command line options). So this is working as intended, and it seems something else is going on after all in your initial program (I forgot about Lazarus defaulting to utf-8 everywhere, I'm not a Lazarus user but a compiler developer).

Jonas Maebe:

--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---{$mode objfpc}{$h+} {$codepage utf8} uses  {$ifdef unix}cwstring,{$endif}sysutils; procedure test;var  a: ansistring;  sa, sc: unicodestring;begin  a := 'ABCD БЮЖД ABCD';  sa := a;  sc := 'ABCD БЮЖД ABCD';   a := IntToStr(StringCodePage( a )) + ' / ' + IntToStr(StringElementSize(a)) + '  ' + a;//  ShowMessage(a);  sa := IntToStr(StringCodePage( sa )) + ' / ' + IntToStr(StringElementSize(sa)) + '  ' + sa;//  ShowMessage(sa);  sc := IntToStr(StringCodePage( sc )) + ' / ' + IntToStr(StringElementSize(sc)) + '  ' + sc;//  ShowMessage(sc);  writeln(a);  writeln(sa);  writeln(sc);end; begin  test;end. This program prints the following for me. So it seems the issue related to the interaction with the LCL. The string codepage 0 for sa is weird though.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version