Forum > Windows

[Solved] Defining Unicode Character Constants

<< < (3/3)

PascalDragon:

--- Quote from: ArminLinder on January 12, 2024, 11:54:00 am ---If so, why then do the characters display properly, if I use $codepage UTF-8, and btw, isn't UTF-8 a character encoding scheme and not a character mapping table (codepage)? I assumed, that $codepage UTF-8 could be a naming mispick, meaning "expect character literals to be specified using UTF-8 encoding", having nothing to do with codepages.
--- End quote ---

It's named "code page", because Windows also considers it a code page (namely CP_UTF8), though there it's considered a so called “Multi Byte Code Page”.

Fun fact: UTF-16 is also considered a code page in Windows (CP_UTF16 and CP_UTF16BE).

Thaddy:
Isn't CP_UTF16BE just UCS2? The precursor of Unicode?
Strictly two bytes and without expanding codepoints to max 4 bytes?
That is not Unicode as we know it.

Zoran:

--- Quote from: Thaddy on January 12, 2024, 06:11:41 pm ---Isn't CP_UTF16BE just UCS2? The precursor of Unicode?
Strictly two bytes and without expanding codepoints to max 4 bytes?
That is not Unicode as we know it.

--- End quote ---

I don't think so. Like UCS2 and unlike UTF8, UTF16 still suffers from endianness ambiguity. Although in UTF16 a character can be represented with more than one word (byte pair), a word can be either big or little endian.

UTF8 doesn't have this problem, as utf8 encoded string is an array of 8-bit bytes, whereas utf16 is an array of 16-bit words.

So, there are still two variants of UTF16 encoding - big and little endian.

PascalDragon:

--- Quote from: Thaddy on January 12, 2024, 06:11:41 pm ---Isn't CP_UTF16BE just UCS2? The precursor of Unicode?
Strictly two bytes and without expanding codepoints to max 4 bytes?
That is not Unicode as we know it.

--- End quote ---

It changed from UCS2 to full UTF-16 once Windows gained full UTF-16 support with Windows 2000.

Navigation

[0] Message Index

[*] Previous page

Go to full version