Recent

Author Topic: UnicodeString assignment from string literals seem broken in ObjFpc mode  (Read 767 times)

Arioch

  • Sr. Member
  • ****
  • Posts: 414
Re: UnicodeString assignment from string literals seem broken in ObjFpc mode
« Reply #15 on: September 26, 2022, 10:27:57 pm »
both default

Delphi - 1251

Lazarus - UTF-8

Arioch

  • Sr. Member
  • ****
  • Posts: 414
Re: UnicodeString assignment from string literals seem broken in ObjFpc mode
« Reply #16 on: September 26, 2022, 10:29:50 pm »
Delphi messages show compiler commands, Laz does not, so won't tell you flags....

/AFK for an hour or until tomorrow

Jonas Maebe

  • Hero Member
  • *****
  • Posts: 1018
Re: UnicodeString assignment from string literals seem broken in ObjFpc mode
« Reply #17 on: September 26, 2022, 10:32:47 pm »
See https://wiki.freepascal.org/Unicode_Support_in_Lazarus#String_Literals for a more detailed explanation of what is happening, and what works/doesn't work depending on whether or not you explicitly tell the compiler that the source file is encoded in utf-8. It's indeed a mess and unfortunate there is no way to have a mode where everything works, but this was impossible to create while also preserving backward compatibility with older code (or at least I didn't figure out a way how to do it).

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 10380
  • FPC developer.
Re: UnicodeString assignment from string literals seem broken in ObjFpc mode
« Reply #18 on: September 26, 2022, 10:42:05 pm »
Easiest solution might be going to "project options ->  application" and then
enabling manifests and utf8 as default encoding.

But note the warning there, it only is for Windows 10 from a certain date.

Most supported versions however should be that date though.


Arioch

  • Sr. Member
  • ****
  • Posts: 414
Re: UnicodeString assignment from string literals seem broken in ObjFpc mode
« Reply #19 on: September 26, 2022, 11:14:17 pm »
MARCOV this should not work

you propose to change runtime variables (of Windows). I already did it, by issuing `chcp` (outside a program) and `SetConcoleoutputCP` (inside)

you seem to assume that EXE file is correct and it is RTL that damages correct data. It is the opposite: RTL correctly interpretes the data - but the data in EXE was broken already during compilation.

Even if you can trigger some reverse bug in RTL, that woud damage data in a complementary way - that would be fragile

Arioch

  • Sr. Member
  • ****
  • Posts: 414
Re: UnicodeString assignment from string literals seem broken in ObjFpc mode
« Reply #20 on: September 26, 2022, 11:34:18 pm »
It's indeed a mess and unfortunate there is no way to have a mode where everything works, but this was impossible to create while also preserving backward compatibility with older code (or at least I didn't figure out a way how to do it).

I don't see any technical problem here, but only organisanional.

  • FPC has one set of defaults
  • Lazarus has different set of defaults
  • Lazarus could explicitly spell out its defaults and demand FPC to obediently adapt to them.
  • (more complicated but still possible), FPC could explicitly lay out defaults and demand Lazarus to abide

But none of the last two seem to happen.

I just did a test with Delphi, console program with two units, one is ANSI-coded another is UTF-8 coded.

By default everything works like a charm. It was decided, that source file i encoded in ANSI unless there is BOM. IF there is BOM - then it is Unicode.
I found no other charset marks in the project files.
I stripped the BOM, rebuild - and the compiler expectedly destroyed the data.

As soon as i broke communication between IDE and compiler (here: BOM) - it went wrong.

So, from pure technical standpoint there are obvious solutions.

1. It can be discouraged to have Unicode without BOM, just like it is made Delphi. LAzarus could make default projects with BOM, and if old project without BOMs opened - could add it.

I just changed in Lazarus encoding to UTF8bom - and FPC immediately picked up the wind.


Arioch

  • Sr. Member
  • ****
  • Posts: 414
Re: UnicodeString assignment from string literals seem broken in ObjFpc mode
« Reply #21 on: September 27, 2022, 12:26:42 am »
2. Lazarus saves the information about chosen charsets (per unit) in project1.lpi and/or project1.lps

FPC picks up those files imlicitly (or is explicitly commanded to compile one of hose files instead of project1.lpr) and learns encoding.

3. Both Lazarus and FPC extract charset-related heuristics into a  separate package, and both use it.
But that is fragile.



I saved LCL Unit in 1251 charset, compiled - and FPC screwed things yet worse. Notice how it still "converts" by just adding zero byte.
Basically, FPC conversion de facto becomes a loop of "Read byte, Write word".

Delphi EXE shows AnsiString to have letters in $C0 - $DF byte range, and UTF-16 string in $0410 - $042F word range
FPC instead, when fed with ANSI (1251) encoded sources, saves MBCS stream the same, but saves UTF-16 stream in a broken $00C0 - $00DF word range.

It seems to me FPC makes assumption, that "non-marked" source file is Latin1 or Window 1250 or something. Would FPC query the platform's charset - it would perhaps.

I speculate (though it is not hard to check) than loading sources (without BOM) Lazarus first tries to de-code from UTF-8, and if it fails then it assumes platform's current non-Unicode charset. Just like it do TStringList.LoadFromFile and TIniFile

This heuristic seems reasonable and perhaps FPC could adopt it, i think it is based in FCL which is shared by both FPC and Laz.
But a better choice i believe would be for Laz to explicitly store charset info, and for FPC to use it.
Unless that would lead to problems with git or svn.

------

Is it reliable, can LAzarus be tricked into mis-detecting charset? Probably it can. But in some very fringe cases, that no one cares of.

Windows Extended ASCII letters start with $C0 and go until $FF
In UTF-8 those letters should be mapped to something like $d0 $a0 or $d1 $40 - but $Cx can not be "continuation byte", it is decoding error.
So, while telling different MBCS codepages from one another is really problematic, telling UTF-8 from non-UTF is not hard.

but then what is platform-default non-unicode charset? I remember from pre-Unicode Linux, it was not all that easy. There even were special libraries back then to facilitate it :-)

So, if FPC and Lazarus do not want to come together and  make explicit specification of charsets in project files,  i can suggest the folloin heuristics,for every souce file:

1.1. check BOM, if it is there - assume "authoritative" charset
1.2. try to transcode from UTF-8 or UTF-7 to UTF-16 or anything. If it happenned without errors - assume UTF-8 (or UTF-7) "authoritative" charset
1.3  try to ask platform HAL is there is non-Unicode default charset. If it is, assume "unreliable" charset

The above are executed as short-eval, "first true exits"

2.0.  Parse .LPI and .LPS files for charset information. Currently there is nothing, so it is "placeholder for a future hook"
2.1.  Look for {$codepage } or {$modeswitch systemcodepage} or other related pragmas. If found, assume one more "authoritative" charset
2.2.  Check FPC command line for char-set related options. It is bad place, because it can not specify different per-file charsets, only one global switch. If found - assume "unreliable" charset.

The above are evaluated all.

3.1. if all the detected charsets match - great, happy compiling
3.2. if all "authoritative" charsets (1 or more) match, but "speculative" do not - emit low-priority warning and compile the unit for "authoritative" one
3.3. if no "authoritative" found but "speculative" ones agree - emit Information or low-priority Warning and compile
3.4. if no charset found at all - emit error and stop, but make this error overridable in fpc.cfg. Or the opposite, make it high-profile warning that can be uplifted to error in cfg
3.5. if no "authoritative" found and "speculative" ones disagree - emit error and stop
« Last Edit: September 27, 2022, 12:29:06 am by Arioch »

paweld

  • Hero Member
  • *****
  • Posts: 512
Re: UnicodeString assignment from string literals seem broken in ObjFpc mode
« Reply #22 on: September 27, 2022, 03:40:53 pm »
The project files are encoded in utf-8, so all strings assigned to variables are as UTF-8. On Windows, UnicodeString is UTF-16, so to assign a string to a variable, you must first convert it to UTF-16 (UTF8ToUTF16).

When changing the code page (SetCodePage) for a string, you must set conversion to True, because cp1251 <> utf8.
Code: Pascal  [Select][+][-]
  1. uses
  2.   LazUTF8;
  3.  
  4. procedure TForm1.FormCreate(Sender: TObject);
  5. var
  6.   a: ansistring;
  7.   sa, sc: unicodestring;
  8. begin
  9.   a := 'ABCD БЮЖД ABCD';
  10.   sa := a;
  11.   sc := UTF8ToUTF16('ABCD БЮЖД ABCD');
  12.  
  13.   a := IntToStr(StringCodePage( a )) + ' / ' + IntToStr(StringElementSize(a)) + '  ' + a;
  14.   sa := IntToStr(StringCodePage( sa )) + ' / ' + IntToStr(StringElementSize(sa)) + '  ' + sa;
  15.   sc := IntToStr(StringCodePage( sc )) + ' / ' + IntToStr(StringElementSize(sc)) + '  ' + sc;
  16.   ShowMessage(a + #13#10 + sa + #13#10 + sc);
  17.  
  18.   sa := '';
  19.  
  20.   a := 'ABCD БЮЖД ABCD';
  21.   SetCodePage(RawByteString(a), 1251, True);
  22.   SetCodePage(RawByteString(a), 65001, True);
  23.   a := IntToStr(StringCodePage( a )) + ' / ' + IntToStr(StringElementSize(a)) + '  ' + a;
  24.   sa := sa + a + #13#10;
  25.  
  26.   a := 'ABCD БЮЖД ABCD';
  27.   SetCodePage(RawByteString(a), 1251, True);
  28.   SetCodePage(RawByteString(a), 65001, True);
  29.   SetCodePage(RawByteString(a), 1251, True);
  30.   a := IntToStr(StringCodePage( a )) + ' / ' + IntToStr(StringElementSize(a)) + '  ' + a;
  31.   sa := sa + a + #13#10;
  32.  
  33.   a := 'ABCD БЮЖД ABCD';
  34.   SetCodePage(RawByteString(a), 65001, False);
  35.   SetCodePage(RawByteString(a), 1251, True);
  36.   SetCodePage(RawByteString(a), 65001, True);
  37.   a := IntToStr(StringCodePage( a )) + ' / ' + IntToStr(StringElementSize(a)) + '  ' + a;
  38.   sa := sa + a + #13#10;
  39.  
  40.   ShowMessage(sa);
  41. end;        
Best regards
paweld

Arioch

  • Sr. Member
  • ****
  • Posts: 414
Re: UnicodeString assignment from string literals seem broken in ObjFpc mode
« Reply #23 on: September 27, 2022, 04:31:36 pm »
When changing the code page (SetCodePage) for a string, you must set conversion to True, because cp1251 <> utf8.

i purposedly was setting false there, to create exactly the combinations of 'raw data' and 'codepage tag' i wanted to test

this whole topic, actually, grew out of https://forum.lazarus.freepascal.org/index.php/topic,60688.0.html

 

paweld

  • Hero Member
  • *****
  • Posts: 512
Re: UnicodeString assignment from string literals seem broken in ObjFpc mode
« Reply #24 on: September 27, 2022, 04:37:16 pm »
For change string code page I prefer the LConvEncoding module: https://lazarus-ccr.sourceforge.io/docs/lazutils/lconvencoding/index-5.html
Best regards
paweld

 

TinyPortal © 2005-2018