Forum > General

UnicodeString assignment from string literals seem broken in ObjFpc mode

(1/5) > >>

Arioch:
UPD. https://gitlab.com/freepascal.org/fpc/source/-/issues/39923

Win7 / Laz 2.2.3 / FPC 3.2.3

Default LCL project,  {$mode objfpc}{$H+}     


--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---procedure TForm1.FormClick(Sender: TObject);var  a: ansistring;  sa, sc: unicodestring;begin  a := 'ABCD БЮЖД ABCD';  sa := a;  sc := 'ABCD БЮЖД ABCD';   a := IntToStr(StringCodePage( a )) + ' / ' + IntToStr(StringElementSize(a)) + '  ' + a;//  ShowMessage(a);  sa := IntToStr(StringCodePage( sa )) + ' / ' + IntToStr(StringElementSize(sa)) + '  ' + sa;//  ShowMessage(sa);  sc := IntToStr(StringCodePage( sc )) + ' / ' + IntToStr(StringElementSize(sc)) + '  ' + sc;//  ShowMessage(sc);  ShowMessage(a + #13#10 + sa + #13#10 + sc);end; 
We should see 3 almost identical lines, right?
No!

Arioch:
and conversions, mostly the same but a small difference in the first one


--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---  sa := '';   a := 'ABCD БЮЖД ABCD';  SetCodePage(RawByteString(a), 1251, True);  SetCodePage(RawByteString(a), 65001, False);  a := IntToStr(StringCodePage( a )) + ' / ' + IntToStr(StringElementSize(a)) + '  ' + a;  sa := sa + a + #13#10;   a := 'ABCD БЮЖД ABCD';  SetCodePage(RawByteString(a), 1251, True);  SetCodePage(RawByteString(a), 65001, False);  SetCodePage(RawByteString(a), 1251, True);  a := IntToStr(StringCodePage( a )) + ' / ' + IntToStr(StringElementSize(a)) + '  ' + a;  sa := sa + a + #13#10;   a := 'ABCD БЮЖД ABCD';  SetCodePage(RawByteString(a), 65001, True);  SetCodePage(RawByteString(a), 1251, False);  SetCodePage(RawByteString(a), 65001, True);  a := IntToStr(StringCodePage( a )) + ' / ' + IntToStr(StringElementSize(a)) + '  ' + a;  sa := sa + a + #13#10;   ShowMessage(sa); 

ASerge:

--- Quote from: Arioch on September 26, 2022, 07:29:59 pm ---We should see 3 almost identical lines, right?
No!

--- End quote ---
By default, the source code is considered to be in the OS encoding. Therefore, the compiler considers a string constant in the OS encoding and stores it in the executable file in this form, although the original file contains it in UTF8 encoding. During execution, a transformation occurs.
You can change the compiler's "opinion" in any of two ways:
1. Save the file in UTF with BOM (via context menu)
2. Add an explicit directive {$CODEPAGE UTF8}

Arioch:

--- Quote from: ASerge on September 26, 2022, 08:36:56 pm ---string constant in the OS encoding and stores it in the executable file in this form

--- End quote ---

1. No. EXE file has those strings in UTF-8 not in windows-1251

2. Would it be so, then both ansistring (UTF-8 in LCL)  and unicodestrings (UTF-16) would be uniformly damaged in one mode, and uniformly fixed in another

We see there the compiler either fails to properly decorate string literals or fails to call encoding-check when assigning to literals...

Jonas Maebe:
Delphi versions with string = unicodestring interpret constant strings in your source file according to the default code page of the machine you are compiling on. If you want to have this behaviour in FPC, add {$modeswitch systemcodepage} (_after_ any {$mode xxx} directives), or use {$mode delphiunicode} (this also changes the string into an alias for unicodestring).

Navigation

[0] Message Index

[#] Next page

Go to full version