Recent

Author Topic: FPC string is not utf8string?  (Read 3126 times)

wylton

  • Jr. Member
  • **
  • Posts: 50
FPC string is not utf8string?
« on: April 02, 2021, 09:34:16 am »
str: string;
 bytesof(str) is different TEncoding.UTF8.GetBytes(str)。。

Bart

  • Hero Member
  • *****
  • Posts: 5275
    • Bart en Mariska's Webstek
Re: FPC string is not utf8string?
« Reply #1 on: April 02, 2021, 10:48:15 am »
By default the type String uses the system codepage (CP_ACP).
In Windows platform this is some ANSI codepage, where each charatcer is exactly 1 byte (and therefore only 256 characters exist).

Bart

jamie

  • Hero Member
  • *****
  • Posts: 6090
Re: FPC string is not utf8string?
« Reply #2 on: April 02, 2021, 12:44:24 pm »
str: string;
 bytesof(str) is different TEncoding.UTF8.GetBytes(str)。。

You are most likely using the CONSOLE output, That by default does not use UTF8 in windows.. but, I think if you convert that to Widestring / UnicodeString you may see a different results..

 recent additions of Delphi use Unicodestring so this seems to work with windows.

 of course the console also needs to be wide string set so it maybe all for not!  :o
The only true wisdom is knowing you know nothing

Bi0T1N

  • Jr. Member
  • **
  • Posts: 85
Re: FPC string is not utf8string?
« Reply #3 on: April 02, 2021, 02:50:55 pm »
str: string;
 bytesof(str) is different TEncoding.UTF8.GetBytes(str)。。
It also depends on the mode you use. E.g. in {$mode delphi} str will be of type AnsiString while in {$mode delphiunicode} it'll be an UnicodeString. Thus one character is 1-byte or 2-bytes long.

Thaddy

  • Hero Member
  • *****
  • Posts: 14205
  • Probably until I exterminate Putin.
Re: FPC string is not utf8string?
« Reply #4 on: May 18, 2021, 10:36:41 am »
It also depends on the mode you use. E.g. in {$mode delphi} str will be of type AnsiString while in {$mode delphiunicode} it'll be an UnicodeString. Thus one character is 1-byte or 2-bytes long.
That is wrong. Per char: Unicodestring is 2..4 bytes long. UTF8String is 1 ..4 bytes long Ansistring is 1 byte long and only UCS2 -precursor to unicode16- is strictly 2 bytes long.
« Last Edit: May 18, 2021, 10:44:06 am by Thaddy »
Specialize a type, not a var.

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1312
    • Lebeau Software
Re: FPC string is not utf8string?
« Reply #5 on: May 18, 2021, 06:05:27 pm »
... Ansistring is 1 byte long ...

AnsiString supports MBCS characters, so not limited to 1-byte characters, either (as evident by the fact you can store a UTF-8 string in an AnsiString).
« Last Edit: May 18, 2021, 06:13:06 pm by Remy Lebeau »
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1312
    • Lebeau Software
Re: FPC string is not utf8string?
« Reply #6 on: May 18, 2021, 06:12:23 pm »
str: string;
 bytesof(str) is different TEncoding.UTF8.GetBytes(str)。。

BytesOf(str) will return the raw bytes of the str's characters as-is.

If string is AnsiString, then TEncoding.UTF8.GetBytes(str) will first have to convert str to Unicode using DefaultSystemCodePage, which will corrupt data if DefaultSystemCodePage is set to a different encoding than str is using. And then GetBytes() will convert the resulting Unicode to UTF-8.

If string is UnicodeString, then that DefaultSystemCodePage conversion is skipped, and GetBytes() will convert str as-is straight to UTF-8.
« Last Edit: May 18, 2021, 06:13:57 pm by Remy Lebeau »
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

 

TinyPortal © 2005-2018