Recent

Author Topic: Bug in string concatenation?  (Read 1690 times)

Avinash

  • Full Member
  • ***
  • Posts: 123
Bug in string concatenation?
« on: December 11, 2024, 05:16:16 pm »

Version: FPC 3.2.2 Win32     OS: Windows 7 32-bit

Code: Pascal  [Select][+][-]
  1. var
  2.   S: String;
  3. begin
  4.  
  5.   //DefaultSystemCodePage := CP_OEMCP;
  6.  
  7.   S := ShortString(UTF8Encode(#255)) + #0#0#0#0;
  8.   WriteLn(  HexStr(PLongInt(@S[1])^, 8)  );
  9.  
  10.   S := UTF8Encode(#255) + #0#0#0#0;           // just no ShortString() typecast
  11.   WriteLn(  HexStr(PLongInt(@S[1])^, 8)  );
  12.  
  13. end.

These two WriteLn give different results for me.

I also get different output here between FPC 3.2.2 and FPC 3.0.4, with the only correct output coming from FPC 3.0.4 in the case of ShortString() typecast, and FPC 3.2.2 giving something weird in both cases.

On my system I get the following results:

FPC 3.0.4 output:
Code: Text  [Select][+][-]
  1. 00008FD1    <=  is the only case with correct result (with a ShortString typecast)
  2. 000000FF        ...FF at least has relation to #255 argument...

FPC 3.2.2
output:
Code: Text  [Select][+][-]
  1. 0000BFC3        ?
  2. 00000079        ???

I will also note that the setting of DefaultSystemCodePage correctly affects the result of FPC 3.0.4, while in FPC 3.2.2 nothing changes.

Fibonacci

  • Hero Member
  • *****
  • Posts: 643
  • Internal Error Hunter
Re: Bug in string concatenation?
« Reply #1 on: December 11, 2024, 07:16:31 pm »
#255 codepoint after encoding to UTF-8 is c3 bf, FPC 3.2.2 is correct.

https://gchq.github.io/CyberChef/#recipe=From_Hex('Auto')Encode_text('UTF-8%20(65001)')To_Hex('Space',0)&input=ZmY&oenc=65001

UTF8Encode() creates a UTF8String with codepage 65001, then you concat it with string codepage 0 "#0#0#0#0" so there is a conversion resuling in $79. Casting to ShortString resets the codepage to 1250, then there is no conversion. You can call SetCodePage() before concatenation (after UTF8Encode) to avoid conversion.



Or, to avoid conversion:

Code: Pascal  [Select][+][-]
  1. DefaultSystemCodePage := CP_UTF8;

so all strings have the same codepage as the result of UTF8Encode
« Last Edit: December 11, 2024, 07:21:22 pm by Fibonacci »

tetrastes

  • Hero Member
  • *****
  • Posts: 611
Re: Bug in string concatenation?
« Reply #2 on: December 11, 2024, 08:33:30 pm »
#255 codepoint after encoding to UTF-8 is c3 bf, FPC 3.2.2 is correct.

It depends how compiler interprets #255. It seems that fpc 3.0.4 treats it as AnsiChar, and that OP's CP is 1251: D1 8F is 'я' in utf-8, FF is 'я' in cp1251.
FPC 3.2.2 treats #255 as UnicodeChar: 'ÿ' is 00 FF in UTF-16 and C3 BF in UTF-8.

Then there is concatenation. It doesn't change the ShortStrings, as they do not have CP information. But for AnsiStrings concatenation results in AnsiString with system CP (cp1251 here), and fpc tries to convert utf-8 to it, so 'ÿ' -> 'y'($79). But more often this results in '?'($3F).  :D



Avinash

  • Full Member
  • ***
  • Posts: 123
Re: Bug in string concatenation?
« Reply #3 on: December 11, 2024, 10:05:35 pm »
It seems that fpc 3.0.4 treats it as AnsiChar, and that OP's CP is 1251: D1 8F is 'я' in utf-8, FF is 'я' in cp1251.
FPC 3.2.2 treats #255 as UnicodeChar: 'ÿ' is 00 FF in UTF-16 and C3 BF in UTF-8.

Indeed. When I add UnicodeChar(#255) typecast, the result of FPC 3.0.4 becomes exactly the same as 3.2.2

However, none of the listed typecasts: Char(#255), AnsiChar, ShortString, AnsiString changes the result of FPC 3.2.2

and fpc tries to convert utf-8 to it, so 'ÿ' -> 'y'($79)

Is compiler doing right in trying to do what he does? Are these two expressions really supposed to give the same result (as they do now)?
Then the quote from the documentation «UTF8Encode convert a widestring or unicodestring to an UTF-8 encoded ansistring» becomes vague. What does it mean, «UTF-8 encoded ansistring»?

Code: Pascal  [Select][+][-]
  1. var
  2.   S: String;
  3. begin
  4.  
  5.   S := UTF8Encode(UnicodeChar(#255)) + #0#0#0#0;
  6.   WriteLn(  HexStr(PLongInt(@S[1])^, 8)  );       //  00000079
  7.  
  8.   S := UnicodeChar(#255) + #0#0#0#0;
  9.   WriteLn(  HexStr(PLongInt(@S[1])^, 8)  );       //  00000079
  10.  
  11. end.

tetrastes

  • Hero Member
  • *****
  • Posts: 611
Re: Bug in string concatenation?
« Reply #4 on: December 11, 2024, 10:34:58 pm »
However, none of the listed typecasts: Char(#255), AnsiChar, ShortString, AnsiString changes the result of FPC 3.2.2
Use {$codepage cp1251}

Quote
Is compiler doing right in trying to do what he does? Are these two expressions really supposed to give the same result (as they do now)?
What are two expressions?

Quote
What does it mean, «UTF-8 encoded ansistring»?
AnsiString has field with CP information, so this is with utf-8 one:
Code: Pascal  [Select][+][-]
  1.   S := UTF8Encode(#255);
  2.   writeln(StringCodePage(S));      // 65001
  3.   WriteLn(  HexStr(PLongInt(@S[1])^, 8)  );     // 0000BFC3
  4.  
  5.   S := S + #0#0#0#0;
  6.   writeln(StringCodePage(S));     // 1251
  7.   WriteLn(  HexStr(PLongInt(@S[1])^, 8) );     // 00000079
  8.  




Avinash

  • Full Member
  • ***
  • Posts: 123
Re: Bug in string concatenation?
« Reply #5 on: December 12, 2024, 01:51:00 am »
Ok, thanks to tetrastes I understood how it all works, now I'm ready to reformulate why it's a bug.

Let's again look at this expression from the point of view of all these codepages:
Code: Pascal  [Select][+][-]
  1. S := UTF8Encode(#255) + #0#0#0#0;
The documentation states that UTF8Encode returns RawByteString, in other words AnsiString(CP_NONE). And, if I understand it correctly, when concatenated with #0#0#0#0 it should not be converted to anything like a system CP, and the result should be the same as with ShortString() typecast:
Code: Pascal  [Select][+][-]
  1. S := ShortString(UTF8Encode(#255)) + #0#0#0#0;

But here, instead of AnsiString(CP_NONE), we get AnsiString(CP_UTF8). It seems that the RawByteString type simply doesn't work properly.

Use {$codepage cp1251}

Looks like a workaround. It's beyond my understanding why String <=> WideString conversion happens in some situations, but not in others:

Code: Pascal  [Select][+][-]
  1. var
  2.   S: ShortString;
  3.   U: UnicodeString;
  4. begin
  5.  
  6.   S := #$FF;
  7.  
  8.   U := S;                               // there is exist type conversion
  9.   WriteLn(  HexStr(Ord(U[1]), 4)  );    // 044F
  10.  
  11.   S := U;                               // there is exist type conversion
  12.   WriteLn(  HexStr(Ord(S[1]), 4)  );    // FF
  13.  
  14.  
  15.   S := UnicodeString(#$044F);           // there is exist type conversion
  16.   WriteLn(  HexStr(Ord(S[1]), 4)  );    // FF
  17.  
  18.   U := ShortString(#$FF);               // there is NO type conversion
  19.   WriteLn(  HexStr(Ord(U[1]), 4)  );    // FF
  20.  
  21. end.

In summary, these things should work much more intuitively than they actually do.
« Last Edit: December 12, 2024, 01:59:14 am by Avinash »

Thaddy

  • Hero Member
  • *****
  • Posts: 16318
  • Censorship about opinions does not belong here.
Re: Bug in string concatenation?
« Reply #6 on: December 12, 2024, 06:44:14 am »
No, simply do not try to out-smart a compiler.
If I smell bad code it usually is bad code and that includes my own code.

tetrastes

  • Hero Member
  • *****
  • Posts: 611
Re: Bug in string concatenation?
« Reply #7 on: December 12, 2024, 10:05:41 am »
But here, instead of AnsiString(CP_NONE), we get AnsiString(CP_UTF8). It seems that the RawByteString type simply doesn't work properly.

https://www.freepascal.org/docs-html/current/rtl/system/utf8encode.html:
Quote
The resulting string has code page CP_UTF8.
And this is clearly seen:
Code: Pascal  [Select][+][-]
  1. function UTF8Encode(const s : WideString) : RawByteString;
  2.   var
  3.     i : SizeInt;
  4.     hs : UTF8String;
  5.   begin
  6.     result:='';
  7.     if s='' then
  8.       exit;
  9.     SetLength(hs,length(s)*3);
  10.     i:=UnicodeToUtf8(pchar(hs),length(hs)+1,PWideChar(s),length(s));
  11.     if i>0 then
  12.       begin
  13.         SetLength(hs,i-1);
  14.         result:=hs;
  15.       end;
  16.   end;

I think that https://www.freepascal.org/docs-html/current/rtl/system/rawbytestring.html is very unclear and may be even improper:
Quote
RawByteString is a single-byte character string which does not have any codepage associated with it.
. . .
the codepage of the destination is simply set to the codepage of the rawbytestring
%)

Quote
It's beyond my understanding why String <=> WideString conversion happens in some situations, but not in others:

It's useful to look at asm code to understand difference
Code: ASM  [Select][+][-]
  1. # [14] S := #$FF;
  2.    movw   $65281,U_$P$PROGRAM_$$_S(%rip)
  3. # [16] U := S;                               // there is exist type conversion
  4.    leaq   U_$P$PROGRAM_$$_S(%rip),%rax
  5.    leaq   U_$P$PROGRAM_$$_U(%rip),%rcx
  6.    movq   %rax,%rdx
  7.    call   fpc_shortstr_to_unicodestr
  8.  
  9. . . .
  10.  
  11. # [26] U := ShortString(#$FF);               // there is NO type conversion
  12.    leaq   U_$P$PROGRAM_$$_U(%rip),%rcx
  13.    leaq   .Ld2(%rip),%rdx
  14.    call   fpc_unicodestr_assign     ; here is simple assigning of constant, the same as U := #$FF;
  15.                 ; because ShortString(#$FF) is ordinary constant and fpc converts it to UnicodeString constant at compile time
  16. . . .
  17.  
  18. .section .rodata.n_.Ld2,"d"
  19.    .balign 8
  20. .Ld2$strlab:
  21.    .short   1200,2
  22.    .long   0
  23.    .quad   -1,1
  24. .Ld2:
  25.    .short   255,0
« Last Edit: December 12, 2024, 10:11:07 am by tetrastes »

Thaddy

  • Hero Member
  • *****
  • Posts: 16318
  • Censorship about opinions does not belong here.
Re: Bug in string concatenation?
« Reply #8 on: December 12, 2024, 11:57:47 am »
It did not occur to you that concatinating shortstrings beyond 255 is silly? or another word starting with S... >:D
If I smell bad code it usually is bad code and that includes my own code.

tetrastes

  • Hero Member
  • *****
  • Posts: 611
Re: Bug in string concatenation?
« Reply #9 on: December 12, 2024, 02:50:27 pm »
Where do you see shortstrings beyond 255?  You so like that Red Evil...  :D

Thaddy

  • Hero Member
  • *****
  • Posts: 16318
  • Censorship about opinions does not belong here.
Re: Bug in string concatenation?
« Reply #10 on: December 12, 2024, 06:08:04 pm »
Well it started here:
Code: Pascal  [Select][+][-]
  1. var
  2.   S: String;
  3. begin
  4.  
  5.   //DefaultSystemCodePage := CP_OEMCP;
  6.  
  7.   S := ShortString(UTF8Encode(#255)) + #0#0#0#0;
Now explain to me that is sane code. About the most stupid code I have seen in 2024. It deserves a medal.

If I smell bad code it usually is bad code and that includes my own code.

tetrastes

  • Hero Member
  • *****
  • Posts: 611
Re: Bug in string concatenation?
« Reply #11 on: December 12, 2024, 06:44:00 pm »
Ah, you mean zeroes, not the length?
Well, may be it is needed to OP for smth, may be this is for example only.
If we change zeroes to f.e. twos, what's the difference for this issue?

Avinash

  • Full Member
  • ***
  • Posts: 123
Re: Bug in string concatenation?
« Reply #12 on: December 12, 2024, 07:36:50 pm »
Now explain to me that is sane code.
The original task was to get #128..#255 characters of the local codepage in UTF8 sequence. Since WriteLn then outputs the first four bytes of the string, and UTF8Encode will return 2-3 characters (or even 0, who knows), I zero at least four bytes to ensure that the output will be result of UTF8Encode, and not possible garbage.

Quote
The resulting string has code page CP_UTF8.

Yes, I am wrong here. I did not look carefully at the description of UTF8Encode, in the old-fashioned way assuming that if the function returns a certain type in the header, then it returns this type, and not anything.

This is not a bug in the literal sense, this is a mess of a bug scale.

Let's go back to the original example:
Code: Pascal  [Select][+][-]
  1. var
  2.   S: ShortString;
  3. begin
  4.  
  5.   S := ShortString(UTF8Encode(#255)) + #0#0#0#0;
  6.   WriteLn(  HexStr(PLongInt(@S[1])^, 8)  );      // 0000BFC3
  7.  
  8.   S := UTF8Encode(#255) + #0#0#0#0;          
  9.   WriteLn(  HexStr(PLongInt(@S[1])^, 8)  );      // 00000079
  10.  
  11. end.

You explained it
Then there is concatenation. It doesn't change the ShortStrings, as they do not have CP information. But for AnsiStrings concatenation results in AnsiString with system CP (cp1251 here), and fpc tries to convert utf-8 to it, so 'ÿ' -> 'y'($79).

but what is the difference between ShortString and AnsiString(CP_ACP) other than the mechanics of data/memory management? How else is the ShortString codepage interpreted other than the system one or the one specified in DefaultSystemCodePage? Why then is typecasting UTF8 to AnsiString results in $79, but to ShortString $BFC3 ?
Next, if typecasting to ShortString works as getting byte sequence of a UTF8 string, then what typecast is needed to get this sequence of bytes as AnsiString, i.e. without the length limit of 255 characters?

Okay, let's say #0#0#0#0 has CP_ACP encoding and therefore concatenation results in $79.
Then let's try typecasting it to UTF8String:

Code: Pascal  [Select][+][-]
  1. var
  2.   S: ShortString;
  3. begin
  4.  
  5.   S := UTF8Encode(#255) + UTF8String(#0#0#0#0);    // 0000BFC3  :)
  6.   WriteLn(  HexStr(PLongInt(@S[1])^, 8)  );
  7.  
  8. end.

it seems to work. But why then does the following give some strange result?
Code: Pascal  [Select][+][-]
  1. S := UTF8String(#255) + UTF8String(#0#0#0#0);      // BFC283C3 ??? :(

if we replace UTF8String with UTF8Encode in the second term, then everything is fine again:
Code: Pascal  [Select][+][-]
  1. S := UTF8String(#255) + UTF8Encode(#0#0#0#0);      // 0000BFC3

It's useful to look at asm code to understand the difference

We only see that compiler doesn't typecast (that was obvious anyway), not why. Why wouldn't it call fpc_shortstr_to_unicodestr before fpc_unicodestr_assign? He does that in case
Code: Pascal  [Select][+][-]
  1. S := UnicodeString(#$044F);
how does a two-byte constant differ from a one-byte one?
« Last Edit: December 12, 2024, 07:42:13 pm by Avinash »

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1447
    • Lebeau Software
Re: Bug in string concatenation?
« Reply #13 on: December 13, 2024, 12:34:31 am »
UTF8Encode() creates a UTF8String with codepage 65001

The encoding of the characters indeed follows codepage 65001, but the code shown earlier for UTF8Encode() clearly doesn't call SetCodePage() (or equivalent) to set the stored codepage of the returned RawByteString to codepage 65001, which will affect runtime conversions when mixing the RawByteString with other string types.
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

Fibonacci

  • Hero Member
  • *****
  • Posts: 643
  • Internal Error Hunter
Re: Bug in string concatenation?
« Reply #14 on: December 13, 2024, 12:47:23 am »
UTF8Encode() creates a UTF8String with codepage 65001

The encoding of the characters indeed follows codepage 65001, but the code shown earlier for UTF8Encode() clearly doesn't call SetCodePage() (or equivalent) to set the stored codepage of the returned RawByteString to codepage 65001

Doesnt have to. It has a local variable UTF8String which is of type AnsiString(CP_UTF8), then its assigned to the result of type RawByteString and the CP is preserved.

 

TinyPortal © 2005-2018