Lazarus

Free Pascal => General => Topic started by: wylton on April 02, 2021, 09:34:16 am

Title: FPC string is not utf8string?
Post by: wylton on April 02, 2021, 09:34:16 am
str: string;
 bytesof(str) is different TEncoding.UTF8.GetBytes(str)。。
Title: Re: FPC string is not utf8string?
Post by: Bart on April 02, 2021, 10:48:15 am
By default the type String uses the system codepage (CP_ACP).
In Windows platform this is some ANSI codepage, where each charatcer is exactly 1 byte (and therefore only 256 characters exist).

Bart
Title: Re: FPC string is not utf8string?
Post by: jamie on April 02, 2021, 12:44:24 pm
str: string;
 bytesof(str) is different TEncoding.UTF8.GetBytes(str)。。

You are most likely using the CONSOLE output, That by default does not use UTF8 in windows.. but, I think if you convert that to Widestring / UnicodeString you may see a different results..

 recent additions of Delphi use Unicodestring so this seems to work with windows.

 of course the console also needs to be wide string set so it maybe all for not!  :o
Title: Re: FPC string is not utf8string?
Post by: Bi0T1N on April 02, 2021, 02:50:55 pm
str: string;
 bytesof(str) is different TEncoding.UTF8.GetBytes(str)。。
It also depends on the mode you use. E.g. in {$mode delphi} str will be of type AnsiString while in {$mode delphiunicode} it'll be an UnicodeString. Thus one character is 1-byte or 2-bytes long.
Title: Re: FPC string is not utf8string?
Post by: Thaddy on May 18, 2021, 10:36:41 am
It also depends on the mode you use. E.g. in {$mode delphi} str will be of type AnsiString while in {$mode delphiunicode} it'll be an UnicodeString. Thus one character is 1-byte or 2-bytes long.
That is wrong. Per char: Unicodestring is 2..4 bytes long. UTF8String is 1 ..4 bytes long Ansistring is 1 byte long and only UCS2 -precursor to unicode16- is strictly 2 bytes long.
Title: Re: FPC string is not utf8string?
Post by: Remy Lebeau on May 18, 2021, 06:05:27 pm
... Ansistring is 1 byte long ...

AnsiString supports MBCS characters, so not limited to 1-byte characters, either (as evident by the fact you can store a UTF-8 string in an AnsiString).
Title: Re: FPC string is not utf8string?
Post by: Remy Lebeau on May 18, 2021, 06:12:23 pm
str: string;
 bytesof(str) is different TEncoding.UTF8.GetBytes(str)。。

BytesOf(str) will return the raw bytes of the str's characters as-is.

If string is AnsiString, then TEncoding.UTF8.GetBytes(str) will first have to convert str to Unicode using DefaultSystemCodePage, which will corrupt data if DefaultSystemCodePage is set to a different encoding than str is using. And then GetBytes() will convert the resulting Unicode to UTF-8.

If string is UnicodeString, then that DefaultSystemCodePage conversion is skipped, and GetBytes() will convert str as-is straight to UTF-8.
TinyPortal © 2005-2018