Now explain to me that is sane code.
The original task was to get #128..#255 characters of the local codepage in UTF8 sequence. Since
WriteLn then outputs the first four bytes of the string, and
UTF8Encode will return 2-3 characters (or even 0, who knows), I zero at least four bytes to ensure that the output will be result of
UTF8Encode, and not possible garbage.
The resulting string has code page CP_UTF8.
Yes, I am wrong here. I did not look carefully at the description of UTF8Encode, in the old-fashioned way assuming that if the function returns a certain type in the header, then it returns this type, and not anything.
This is not a bug in the literal sense, this is a mess of a bug scale.
Let's go back to the original example:
var
S: ShortString;
begin
S := ShortString(UTF8Encode(#255)) + #0#0#0#0;
WriteLn( HexStr(PLongInt(@S[1])^, 8) ); // 0000BFC3
S := UTF8Encode(#255) + #0#0#0#0;
WriteLn( HexStr(PLongInt(@S[1])^, 8) ); // 00000079
end.
You explained it
Then there is concatenation. It doesn't change the ShortStrings, as they do not have CP information. But for AnsiStrings concatenation results in AnsiString with system CP (cp1251 here), and fpc tries to convert utf-8 to it, so 'ÿ' -> 'y'($79).
but what is the difference between ShortString and AnsiString(CP_ACP) other than the mechanics of data/memory management? How else is the ShortString codepage interpreted other than the system one or the one specified in DefaultSystemCodePage? Why then is typecasting UTF8 to AnsiString results in $79, but to ShortString $BFC3 ?
Next, if typecasting to ShortString works as getting byte sequence of a UTF8 string, then what typecast is needed to get this sequence of bytes as AnsiString, i.e. without the length limit of 255 characters?
Okay, let's say
#0#0#0#0 has CP_ACP encoding and therefore concatenation results in $79.
Then let's try typecasting it to
UTF8String:
var
S: ShortString;
begin
S := UTF8Encode(#255) + UTF8String(#0#0#0#0); // 0000BFC3 :)
WriteLn( HexStr(PLongInt(@S[1])^, 8) );
end.
it seems to work. But why then does the following give some strange result?
S := UTF8String(#255) + UTF8String(#0#0#0#0); // BFC283C3 ??? :(
if we replace
UTF8String with UTF8Encode in the second term, then everything is fine again:
S := UTF8String(#255) + UTF8Encode(#0#0#0#0); // 0000BFC3
It's useful to look at asm code to understand the difference
We only see that compiler doesn't typecast (that was obvious anyway), not why. Why wouldn't it call
fpc_shortstr_to_unicodestr before
fpc_unicodestr_assign? He does that in case
S := UnicodeString(#$044F);
how does a two-byte constant differ from a one-byte one?