Recent

Author Topic: Distinguish between UTF-8, UTF-16, UTF-32  (Read 19320 times)

Tommi

  • Full Member
  • ***
  • Posts: 213
Distinguish between UTF-8, UTF-16, UTF-32
« on: October 28, 2016, 03:11:27 pm »
Is there a way to distinguish between UTF-8, UTF-16, UTF-32 without low level implement these codings ?

For example in UTF-8 all bytes after the first start with 10 bits. Is there a ready library to check this ?

Thank you

Fungus

  • Sr. Member
  • ****
  • Posts: 353
Re: Distinguish between UTF-8, UTF-16, UTF-32
« Reply #1 on: October 28, 2016, 03:46:30 pm »
You should know what the character width is, either from a BOM or some other "header". Since UTF-8 (AFAIK) never will contain any null-bytes you could search for null-bytes in order to detect UTF16 encoding. UTF16 will (AFAIK) never contain any null-words so if null-words are present UTF32 must be assumed. Searching for null-bytes / null-words to detect character width is not reliable, though!

Bart

  • Hero Member
  • *****
  • Posts: 5290
    • Bart en Mariska's Webstek
Re: Distinguish between UTF-8, UTF-16, UTF-32
« Reply #2 on: October 28, 2016, 06:43:40 pm »
In LazUtf8 there is a function that reliably tells wether a given string is UTF8.

Bart

wp

  • Hero Member
  • *****
  • Posts: 11923
Re: Distinguish between UTF-8, UTF-16, UTF-32
« Reply #3 on: October 28, 2016, 07:03:12 pm »
Bart, you seem to refer to "ValidUTF8String()". It contains this code:

Code: Pascal  [Select][+][-]
  1. function ValidUTF8String(const s: String): String;
  2. begin
  3.   // .... //
  4.   Result := '';
  5.   cur := p;
  6.   while cur^ <> #0 do
  7.   begin
  8.     l := UTF8CharacterLength(cur);
  9.     if (l = 1) and (cur^ < #32) then
  10.       Result := Result + '#' + IntToStr(Ord(cur^))
  11.     else
  12.   // ... //
  13.  

This code terminates scanning the string if it finds a #0. Therefore the input string cannot contain a #0. Try this; it will display only 'abc', instead of 'abc'#0'123':

Code: Pascal  [Select][+][-]
  1.   ShowMessage(ValidUTF8String('abc'#0'123'));

I know - C strings have the zero at their end, but Pascal strings don't. Wouldn't it be better to scan the string up to its Length which is well-known in case of a Pascal string?

Or does the term "UTF8" automatically imply "no embedded zeros"?
« Last Edit: October 28, 2016, 07:10:34 pm by wp »

Fungus

  • Sr. Member
  • ****
  • Posts: 353
Re: Distinguish between UTF-8, UTF-16, UTF-32
« Reply #4 on: October 28, 2016, 07:38:00 pm »
Or does the term "UTF8" automatically imply "no embedded zeros"?

#0 is a non-visual control character that indicates end of string. It cannot exist in a valid UTF8 string.

wp

  • Hero Member
  • *****
  • Posts: 11923
Re: Distinguish between UTF-8, UTF-16, UTF-32
« Reply #5 on: October 28, 2016, 07:56:28 pm »
Or does the term "UTF8" automatically imply "no embedded zeros"?

#0 is a non-visual control character that indicates end of string. It cannot exist in a valid UTF8 string.

I found this here: https://en.wikipedia.org/wiki/UTF-8
Quote
In Modified UTF-8 (MUTF-8), the null character (U+0000) uses the two-byte overlong encoding 11000000 10000000 (hexadecimal C0 80), instead of 00000000 (hexadecimal 00). Modified UTF-8 strings never contain any actual null bytes but can contain all Unicode code points including U+0000,[28] which allows such strings (with a null byte appended) to be processed by traditional null-terminated string functions.
It says that "modified UTF-8" string never contain embedded zeros, for sure. But is Lazarus/FPC working with "modified" or "standard" UTF-8. If it supports modified UTF-8 then "ValidUTF8String()" should replace the zero by the U+0000 character. It it does not it should replace it by #0 like the other control characters. But it should never truncate the string. This looks like a bug to me.

[EDIT]
BTW, why does ValidUTF8String() convert a control character to '#' + StrToInt(ord(character))? This is a massive modification of the the input string. What is the purpose of ValidUTF8String?
« Last Edit: October 28, 2016, 08:21:28 pm by wp »

Tommi

  • Full Member
  • ***
  • Posts: 213
Re: Distinguish between UTF-8, UTF-16, UTF-32
« Reply #6 on: October 28, 2016, 08:03:32 pm »
OK, you put me on the right way.

Thank you guys

Fungus

  • Sr. Member
  • ****
  • Posts: 353
Re: Distinguish between UTF-8, UTF-16, UTF-32
« Reply #7 on: October 28, 2016, 08:06:34 pm »
The question in this thread is how to determine if (I assume) an untyped buffer is UTF8, UTF16 or UTF32. A null-byte (or #0) cannot exist in a valid UTF8 string - if #0 is present it will be encoded to two bytes from which none are null. Therefore a null-byte in the untyped buffer must indicate that the buffer contains UTF16 or UTF32. And the same happens with a null-word which cannot exist in a valid UTF16 string. So the presence of null-bytes and null-words can be used to determine the encoding used - even if it is not safe to do so (UTF32 can have a null-byte and UTF16 can have no null-bytes).

wp

  • Hero Member
  • *****
  • Posts: 11923
Re: Distinguish between UTF-8, UTF-16, UTF-32
« Reply #8 on: October 28, 2016, 08:38:18 pm »
if #0 is present it will be encoded to two bytes from which none are null.
Not sure... I'm not a UTF-8 specialist, forgive me if this is nonsense: but the wikipedia article above says this is true for "modified utf-8". From this I conclude that there is also a "standard utf-8" in which the zero is allowed as a control character within the string. In Pascal, this does not cause trouble with the string length because the length is stored in two bytes in front of the characters, not indiectly as an appended null as in C strings.

Let me say it in another way: If I create a string in Lazarus - which uses UTF-8 - as 'abc'#0'äöüß' this is certainly a UTF8 string but is reckognized by your algorithm as UTF16 or UTF32.
« Last Edit: October 28, 2016, 08:43:31 pm by wp »

Fungus

  • Sr. Member
  • ****
  • Posts: 353
Re: Distinguish between UTF-8, UTF-16, UTF-32
« Reply #9 on: October 28, 2016, 08:50:32 pm »
A null-byte cannot exist in a valid UTF8 encoded string (period!). The standard format does not allow it at all and the modified format encodes the null-byte to two non-null bytes. I'm not 100% sure, but the modified flavour is a Java thingy: http://docs.oracle.com/javase/6/docs/api/java/io/DataInput.html#modified-utf-8

Let me say it in another way: If I create a string in Lazarus - which uses UTF-8 - as 'abc'#0'äöüß' this is certainly a UTF8 string but is reckognized by your algorithm as UTF16 or UTF32.

Please don't mistake a textual representation with the binary representation. Setting a string to 'abc'#0'äöüß' will result in a UTF8 string of value 'abc' since the #0 is invalid.
« Last Edit: October 28, 2016, 08:56:50 pm by Fungus »

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1314
    • Lebeau Software
Re: Distinguish between UTF-8, UTF-16, UTF-32
« Reply #10 on: October 28, 2016, 10:05:42 pm »
the wikipedia article above says this is true for "modified utf-8". From this I conclude that there is also a "standard utf-8"

Yes, there is: https://tools.ietf.org/html/rfc3629

"Modified UTF-8" is only used by Java, and even then only for string serialization in the DataInput and DataOutput classes.  Java also supports standard UTF-8 for normal string handling.  Everything and everyone else uses standard UTF-8.

in which the zero is allowed as a control character within the string.

That is absolutely true in standard UTF-8, which does not restrict any control characters, including null (and neither do any other standard UTF, for that matter).
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1314
    • Lebeau Software
Re: Distinguish between UTF-8, UTF-16, UTF-32
« Reply #11 on: October 28, 2016, 10:13:34 pm »
A null-byte cannot exist in a valid UTF8 encoded string (period!).

That is absolutely wrong.  Standard UTF-8 does not restrict use of null in any way.  It is a perfectly valid and acceptable character to encode.  The official RFC that defines UTF-8, RFC 3629, even says so, pointing out several times that U+0000 is acceptable and is encoded as a 0x00 byte.

Now, it may be that UTF-8 *when used in the context of something else* might not allow nulls, but that would be a restriction of that "something else", UTF-8 itself does not restrict it.

The standard format does not allow it at all

Yes, it does.

and the modified format encodes the null-byte to two non-null bytes.

That is why it is "modified".  But even then, the fact that it encodes null characters means that input strings are allowed to have null characters to begin with, they are simply not encoded using null bytes.

I'm not 100% sure, but the modified flavour is a Java thingy

Yes, it is.  Nobody else uses it.

Please don't mistake a textual representation with the binary representation. Setting a string to 'abc'#0'äöüß' will result in a UTF8 string of value 'abc' since the #0 is invalid.

#0 is a perfectly valid string character.  It is only C-style strings that treat #0 special.  Other languages, including Pascal, don't have that restriction.
« Last Edit: October 28, 2016, 10:16:09 pm by Remy Lebeau »
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

wp

  • Hero Member
  • *****
  • Posts: 11923
Re: Distinguish between UTF-8, UTF-16, UTF-32
« Reply #12 on: October 28, 2016, 11:45:47 pm »
Thanks, Remy, for this clarification.

So, you would agree that the function ValidUTF8String() in LazUtf8 which truncates a string at an embedded null is faulty? If yes, I'll write a bug report. And there should be an eye on other string conversion/processing/parsing routines because I've seen the check for #0 in almost every other routine in LazUtf8.

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1314
    • Lebeau Software
Re: Distinguish between UTF-8, UTF-16, UTF-32
« Reply #13 on: October 29, 2016, 02:28:31 am »
So, you would agree that the function ValidUTF8String() in LazUtf8 which truncates a string at an embedded null is faulty?

I don't want to speculate on that, as I don't know the intended purpose of ValidUTF8String().  Truncating on #0 may be faulty or valid.  *Logically*, I would lean towards faulty, but on the other hand I don't know why it is altering the string to convert all control characters to '#XX' syntax, either.  What is "invalid" about visual control controls like line breaks and tabs?  That seems odd to me.
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

Thaddy

  • Hero Member
  • *****
  • Posts: 14382
  • Sensorship about opinions does not belong here.
Re: Distinguish between UTF-8, UTF-16, UTF-32
« Reply #14 on: October 29, 2016, 11:04:11 am »
Or does the term "UTF8" automatically imply "no embedded zeros"?

#0 is a non-visual control character that indicates end of string. It cannot exist in a valid UTF8 string.

No. Only in C. In Pascal end of string is determined by the string descriptor at a negative offset of the payload. And hence more efficient.
A Pascal stringtype can contain embedded zero's all the time. A C string type can not. And since UTF8String is a Pascal type it can contain zero's. If not that's a bug.
Unless it is part of the UTF8 specification, which seems to say nothing of the sort.

Demo that #0 is just not printable:
Code: Pascal  [Select][+][-]
  1. program untitled;
  2.  
  3. var a:UTF8String = 'Can it contain'#0' or not';
  4. begin
  5.   writeln(a);
  6. end.
  7.  

Outputs:
Code: [Select]
pi@raspberrypi:~ $ ./testme
Can it contain or not
Which is correct for Pascal but not for C.
Point taken?  >:D
« Last Edit: October 29, 2016, 11:16:02 am by Thaddy »
Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

 

TinyPortal © 2005-2018