Distinguish between UTF-8, UTF-16, UTF-32

Tommi

Full Member
Posts: 213

Distinguish between UTF-8, UTF-16, UTF-32

« on: October 28, 2016, 03:11:27 pm »

Is there a way to distinguish between UTF-8, UTF-16, UTF-32 without low level implement these codings ?

For example in UTF-8 all bytes after the first start with 10 bits. Is there a ready library to check this ?

Thank you

Logged

Fungus

Sr. Member
Posts: 353

Re: Distinguish between UTF-8, UTF-16, UTF-32

« Reply #1 on: October 28, 2016, 03:46:30 pm »

You should know what the character width is, either from a BOM or some other "header". Since UTF-8 (AFAIK) never will contain any null-bytes you could search for null-bytes in order to detect UTF16 encoding. UTF16 will (AFAIK) never contain any null-words so if null-words are present UTF32 must be assumed. Searching for null-bytes / null-words to detect character width is not reliable, though!

Logged

Bart

Hero Member
Posts: 5290

Re: Distinguish between UTF-8, UTF-16, UTF-32

« Reply #2 on: October 28, 2016, 06:43:40 pm »

In LazUtf8 there is a function that reliably tells wether a given string is UTF8.

Bart

Logged

wp

Hero Member
Posts: 11923

Re: Distinguish between UTF-8, UTF-16, UTF-32

« Reply #3 on: October 28, 2016, 07:03:12 pm »

Bart, you seem to refer to "ValidUTF8String()". It contains this code:

Code: Pascal [Select][+]

function ValidUTF8String(const s: String): String;
begin
  // .... //
  Result := '';
  cur := p;
  while cur^ <> #0 do
  begin
    l := UTF8CharacterLength(cur);
    if (l = 1) and (cur^ < #32) then
      Result := Result + '#' + IntToStr(Ord(cur^))
    else
  // ... //
 

This code terminates scanning the string if it finds a #0. Therefore the input string cannot contain a #0. Try this; it will display only 'abc', instead of 'abc'#0'123':

Code: Pascal [Select][+]

  ShowMessage(ValidUTF8String('abc'#0'123'));

I know - C strings have the zero at their end, but Pascal strings don't. Wouldn't it be better to scan the string up to its Length which is well-known in case of a Pascal string?

Or does the term "UTF8" automatically imply "no embedded zeros"?

« Last Edit: October 28, 2016, 07:10:34 pm by wp »

Logged

Fungus

Sr. Member
Posts: 353

Re: Distinguish between UTF-8, UTF-16, UTF-32

« Reply #4 on: October 28, 2016, 07:38:00 pm »

Quote from: wp on October 28, 2016, 07:03:12 pm

Or does the term "UTF8" automatically imply "no embedded zeros"?

#0 is a non-visual control character that indicates end of string. It cannot exist in a valid UTF8 string.

Logged

wp

Hero Member
Posts: 11923

Re: Distinguish between UTF-8, UTF-16, UTF-32

« Reply #5 on: October 28, 2016, 07:56:28 pm »

Quote from: Fungus on October 28, 2016, 07:38:00 pm

Quote from: wp on October 28, 2016, 07:03:12 pm
Or does the term "UTF8" automatically imply "no embedded zeros"?

#0 is a non-visual control character that indicates end of string. It cannot exist in a valid UTF8 string.

I found this here: https://en.wikipedia.org/wiki/UTF-8

Quote

In Modified UTF-8 (MUTF-8), the null character (U+0000) uses the two-byte overlong encoding 11000000 10000000 (hexadecimal C0 80), instead of 00000000 (hexadecimal 00). Modified UTF-8 strings never contain any actual null bytes but can contain all Unicode code points including U+0000,^[28] which allows such strings (with a null byte appended) to be processed by traditional null-terminated string functions.

It says that "modified UTF-8" string never contain embedded zeros, for sure. But is Lazarus/FPC working with "modified" or "standard" UTF-8. If it supports modified UTF-8 then "ValidUTF8String()" should replace the zero by the U+0000 character. It it does not it should replace it by #0 like the other control characters. But it should never truncate the string. This looks like a bug to me.

[EDIT]
BTW, why does ValidUTF8String() convert a control character to '#' + StrToInt(ord(character))? This is a massive modification of the the input string. What is the purpose of ValidUTF8String?

« Last Edit: October 28, 2016, 08:21:28 pm by wp »

Logged

Tommi

Full Member
Posts: 213

Re: Distinguish between UTF-8, UTF-16, UTF-32

« Reply #6 on: October 28, 2016, 08:03:32 pm »

OK, you put me on the right way.

Thank you guys

Logged

Fungus

Sr. Member
Posts: 353

Re: Distinguish between UTF-8, UTF-16, UTF-32

« Reply #7 on: October 28, 2016, 08:06:34 pm »

The question in this thread is how to determine if (I assume) an untyped buffer is UTF8, UTF16 or UTF32. A null-byte (or #0) cannot exist in a valid UTF8 string - if #0 is present it will be encoded to two bytes from which none are null. Therefore a null-byte in the untyped buffer must indicate that the buffer contains UTF16 or UTF32. And the same happens with a null-word which cannot exist in a valid UTF16 string. So the presence of null-bytes and null-words can be used to determine the encoding used - even if it is not safe to do so (UTF32 can have a null-byte and UTF16 can have no null-bytes).

Logged

wp

Hero Member
Posts: 11923

Re: Distinguish between UTF-8, UTF-16, UTF-32

« Reply #8 on: October 28, 2016, 08:38:18 pm »

Quote from: Fungus on October 28, 2016, 08:06:34 pm

if #0 is present it will be encoded to two bytes from which none are null.

Not sure... I'm not a UTF-8 specialist, forgive me if this is nonsense: but the wikipedia article above says this is true for "modified utf-8". From this I conclude that there is also a "standard utf-8" in which the zero is allowed as a control character within the string. In Pascal, this does not cause trouble with the string length because the length is stored in two bytes in front of the characters, not indiectly as an appended null as in C strings.

Let me say it in another way: If I create a string in Lazarus - which uses UTF-8 - as 'abc'#0'äöüß' this is certainly a UTF8 string but is reckognized by your algorithm as UTF16 or UTF32.

« Last Edit: October 28, 2016, 08:43:31 pm by wp »

Logged

Fungus

Sr. Member
Posts: 353

Re: Distinguish between UTF-8, UTF-16, UTF-32

« Reply #9 on: October 28, 2016, 08:50:32 pm »

A null-byte cannot exist in a valid UTF8 encoded string (period!). The standard format does not allow it at all and the modified format encodes the null-byte to two non-null bytes. I'm not 100% sure, but the modified flavour is a Java thingy: http://docs.oracle.com/javase/6/docs/api/java/io/DataInput.html#modified-utf-8

Quote from: wp on October 28, 2016, 08:38:18 pm

Let me say it in another way: If I create a string in Lazarus - which uses UTF-8 - as 'abc'#0'äöüß' this is certainly a UTF8 string but is reckognized by your algorithm as UTF16 or UTF32.

Please don't mistake a textual representation with the binary representation. Setting a string to 'abc'#0'äöüß' will result in a UTF8 string of value 'abc' since the #0 is invalid.

« Last Edit: October 28, 2016, 08:56:50 pm by Fungus »

Logged

Remy Lebeau

Hero Member
Posts: 1314

Re: Distinguish between UTF-8, UTF-16, UTF-32

« Reply #10 on: October 28, 2016, 10:05:42 pm »

Quote from: wp on October 28, 2016, 08:38:18 pm

the wikipedia article above says this is true for "modified utf-8". From this I conclude that there is also a "standard utf-8"

Yes, there is: https://tools.ietf.org/html/rfc3629

"Modified UTF-8" is only used by Java, and even then only for string serialization in the DataInput and DataOutput classes. Java also supports standard UTF-8 for normal string handling. Everything and everyone else uses standard UTF-8.

Quote from: wp on October 28, 2016, 08:38:18 pm

in which the zero is allowed as a control character within the string.

That is absolutely true in standard UTF-8, which does not restrict any control characters, including null (and neither do any other standard UTF, for that matter).

Logged

Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

Remy Lebeau

Hero Member
Posts: 1314

Re: Distinguish between UTF-8, UTF-16, UTF-32

« Reply #11 on: October 28, 2016, 10:13:34 pm »

Quote from: Fungus on October 28, 2016, 08:50:32 pm

A null-byte cannot exist in a valid UTF8 encoded string (period!).

That is absolutely wrong. Standard UTF-8 does not restrict use of null in any way. It is a perfectly valid and acceptable character to encode. The official RFC that defines UTF-8, RFC 3629, even says so, pointing out several times that U+0000 is acceptable and is encoded as a 0x00 byte.

Now, it may be that UTF-8 *when used in the context of something else* might not allow nulls, but that would be a restriction of that "something else", UTF-8 itself does not restrict it.

Quote from: Fungus on October 28, 2016, 08:50:32 pm

The standard format does not allow it at all

Yes, it does.

Quote from: Fungus on October 28, 2016, 08:50:32 pm

and the modified format encodes the null-byte to two non-null bytes.

That is why it is "modified". But even then, the fact that it encodes null characters means that input strings are allowed to have null characters to begin with, they are simply not encoded using null bytes.

Quote from: Fungus on October 28, 2016, 08:50:32 pm

I'm not 100% sure, but the modified flavour is a Java thingy

Yes, it is. Nobody else uses it.

Quote from: Fungus on October 28, 2016, 08:50:32 pm

Please don't mistake a textual representation with the binary representation. Setting a string to 'abc'#0'äöüß' will result in a UTF8 string of value 'abc' since the #0 is invalid.

#0 is a perfectly valid string character. It is only C-style strings that treat #0 special. Other languages, including Pascal, don't have that restriction.

« Last Edit: October 28, 2016, 10:16:09 pm by Remy Lebeau »

Logged

Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

wp

Hero Member
Posts: 11923

Re: Distinguish between UTF-8, UTF-16, UTF-32

« Reply #12 on: October 28, 2016, 11:45:47 pm »

Thanks, Remy, for this clarification.

So, you would agree that the function ValidUTF8String() in LazUtf8 which truncates a string at an embedded null is faulty? If yes, I'll write a bug report. And there should be an eye on other string conversion/processing/parsing routines because I've seen the check for #0 in almost every other routine in LazUtf8.

Logged

Remy Lebeau

Hero Member
Posts: 1314

Re: Distinguish between UTF-8, UTF-16, UTF-32

« Reply #13 on: October 29, 2016, 02:28:31 am »

Quote from: wp on October 28, 2016, 11:45:47 pm

So, you would agree that the function ValidUTF8String() in LazUtf8 which truncates a string at an embedded null is faulty?

I don't want to speculate on that, as I don't know the intended purpose of ValidUTF8String(). Truncating on #0 may be faulty or valid. *Logically*, I would lean towards faulty, but on the other hand I don't know why it is altering the string to convert all control characters to '#XX' syntax, either. What is "invalid" about visual control controls like line breaks and tabs? That seems odd to me.

Logged

Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

Thaddy

Hero Member
Posts: 14382
Sensorship about opinions does not belong here.

Re: Distinguish between UTF-8, UTF-16, UTF-32

« Reply #14 on: October 29, 2016, 11:04:11 am »

Quote from: Fungus on October 28, 2016, 07:38:00 pm

Quote from: wp on October 28, 2016, 07:03:12 pm
Or does the term "UTF8" automatically imply "no embedded zeros"?

#0 is a non-visual control character that indicates end of string. It cannot exist in a valid UTF8 string.

No. Only in C. In Pascal end of string is determined by the string descriptor at a negative offset of the payload. And hence more efficient.
A Pascal stringtype can contain embedded zero's all the time. A C string type can not. And since UTF8String is a Pascal type it can contain zero's. If not that's a bug.
Unless it is part of the UTF8 specification, which seems to say nothing of the sort.

Demo that #0 is just not printable:

Code: Pascal [Select][+]

program untitled;
 
var a:UTF8String = 'Can it contain'#0' or not';
begin
  writeln(a);
end.
 

Outputs:

Code: [Select]

pi@raspberrypi:~ $ ./testme
Can it contain or not

Which is correct for Pascal but not for C.
Point taken?

« Last Edit: October 29, 2016, 11:16:02 am by Thaddy »

Logged

Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

Lazarus

Bookstore

Search

Recent

Author Topic: Distinguish between UTF-8, UTF-16, UTF-32 (Read 19320 times)

Tommi

Distinguish between UTF-8, UTF-16, UTF-32

Fungus

Re: Distinguish between UTF-8, UTF-16, UTF-32

Bart

Re: Distinguish between UTF-8, UTF-16, UTF-32

wp

Re: Distinguish between UTF-8, UTF-16, UTF-32

Fungus

Re: Distinguish between UTF-8, UTF-16, UTF-32

wp

Re: Distinguish between UTF-8, UTF-16, UTF-32

Tommi

Re: Distinguish between UTF-8, UTF-16, UTF-32

Fungus

Re: Distinguish between UTF-8, UTF-16, UTF-32

wp

Re: Distinguish between UTF-8, UTF-16, UTF-32

Fungus

Re: Distinguish between UTF-8, UTF-16, UTF-32

Remy Lebeau

Re: Distinguish between UTF-8, UTF-16, UTF-32

Remy Lebeau

Re: Distinguish between UTF-8, UTF-16, UTF-32

wp

Re: Distinguish between UTF-8, UTF-16, UTF-32

Remy Lebeau

Re: Distinguish between UTF-8, UTF-16, UTF-32

Thaddy

Re: Distinguish between UTF-8, UTF-16, UTF-32

	Computer Math and Games in Pascal (preview)
	Lazarus Handbook