#255 codepoint after encoding to UTF-8 is c3 bf, FPC 3.2.2 is correct.
It seems that fpc 3.0.4 treats it as AnsiChar, and that OP's CP is 1251: D1 8F is 'я' in utf-8, FF is 'я' in cp1251.
FPC 3.2.2 treats #255 as UnicodeChar: 'ÿ' is 00 FF in UTF-16 and C3 BF in UTF-8.
and fpc tries to convert utf-8 to it, so 'ÿ' -> 'y'($79)
However, none of the listed typecasts: Char(#255), AnsiChar, ShortString, AnsiString changes the result of FPC 3.2.2Use {$codepage cp1251}
Is compiler doing right in trying to do what he does? Are these two expressions really supposed to give the same result (as they do now)?What are two expressions?
What does it mean, «UTF-8 encoded ansistring»?AnsiString has field with CP information, so this is with utf-8 one:
Use {$codepage cp1251}
But here, instead of AnsiString(CP_NONE), we get AnsiString(CP_UTF8). It seems that the RawByteString type simply doesn't work properly.
The resulting string has code page CP_UTF8.And this is clearly seen:
RawByteString is a single-byte character string which does not have any codepage associated with it.%)
. . .
the codepage of the destination is simply set to the codepage of the rawbytestring
It's beyond my understanding why String <=> WideString conversion happens in some situations, but not in others:
Now explain to me that is sane code.The original task was to get #128..#255 characters of the local codepage in UTF8 sequence. Since WriteLn then outputs the first four bytes of the string, and UTF8Encode will return 2-3 characters (or even 0, who knows), I zero at least four bytes to ensure that the output will be result of UTF8Encode, and not possible garbage.
QuoteThe resulting string has code page CP_UTF8.
Then there is concatenation. It doesn't change the ShortStrings, as they do not have CP information. But for AnsiStrings concatenation results in AnsiString with system CP (cp1251 here), and fpc tries to convert utf-8 to it, so 'ÿ' -> 'y'($79).
It's useful to look at asm code to understand the difference
UTF8Encode() creates a UTF8String with codepage 65001
UTF8Encode() creates a UTF8String with codepage 65001
The encoding of the characters indeed follows codepage 65001, but the code shown earlier for UTF8Encode() clearly doesn't call SetCodePage() (or equivalent) to set the stored codepage of the returned RawByteString to codepage 65001
Doesnt have to. It has a local variable UTF8String which is of type AnsiString(CP_UTF8), then its assigned to the result of type RawByteString and the CP is preserved.