Recent

Author Topic: Moving several units from UTF-16 to UTF-8  (Read 8955 times)

EganSolo

  • Sr. Member
  • ****
  • Posts: 290
Re: Moving several units from UTF-16 to UTF-8
« Reply #15 on: August 23, 2017, 09:27:45 pm »
The following code is taken from a unit called ATStringProc

Code: Pascal  [Select][+][-]
  1. type
  2.   atString = UnicodeString;
  3.   atChar = WideChar;
  4.   PatChar = PWideChar;
  5.  
  6. function SCharUpper(ch: atChar): atChar;
  7. function SCharLower(ch: atChar): atChar;
  8. function SCaseTitle(const S, SWordChars: atString): atString;
  9. function SCaseInvert(const S: atString): atString;
  10. function SCaseSentence(const S, SWordChars: atString): atString;
  11.  

I would need to switch atChar and atString from WideChar and Unicode to String.

Why not just update the atChar functions to work with UTF-16 strings instead? 

I'm not following here, Remy. atChar is an alias for WideChar, as you can see from the type definition above. It's not a function. I suppose you mean something like this:
Code: Pascal  [Select][+][-]
  1. Type
  2. {$IfDef UsingUTF8}
  3.    atChar   = String;
  4.    atString = String;
  5. {$Else}
  6.    atChar   = WideChar;
  7.    atString = UnicodeString;
  8. {$EndIf}
  9.  

Quote
And then you can update the call sites to take UTF-16 surrogates into account when creating substrings to pass to the functions.  That is going to be far less work than switching everything to UTF-8.  You would have to change the functions to use strings anyway to handle 2-4 byte UTF-8 sequences, so you may as well do the same work to handle 2- and 4- byte UTF-16 sequences instead.

My parser is UTF-8, which is why I would rather do everything in Unicode. Since Lazarus supports the full Unicode space using UTF-8, I'd rather stick to that.

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1314
    • Lebeau Software
Re: Moving several units from UTF-16 to UTF-8
« Reply #16 on: August 23, 2017, 09:54:53 pm »
I'm not following here, Remy. atChar is an alias for WideChar

I know. What I'm saying is that when dealing with UTF-16, sometimes you have to deal with UTF-16 surrogates (2 WideChars acting together), so it is better to change single-WideChar functions (in this case, SCharUpper() and SCharLower()) to NOT operate on a single WideChar anymore, but to instead operate on a String instead, so it an hold 1 or 2 WideChars depending on the value of the codepoint being operated on.  That way, you can take surrogates into account when needed to get the right output.

I suppose you mean something like this:

That is not what I said, not even close.

My parser is UTF-8, which is why I would rather do everything in Unicode.

UTF-16 is also Unicode.  As is UTF-32.  And any other UTF, for that matter.

Since Lazarus supports the full Unicode space using UTF-8, I'd rather stick to that.

Then you have your work cut out for you, because processing UTF-8 is more difficult than processing UTF-16 when dealing with non-ASCII characters.  In UTF-8, only 7bit ASCII codepoints use 1 codeunit (ie, a single AnsiChar), non-ASCII codepoints use 2-4 codeunits (multiple AnsiChars).  In UTF-16, Unicode codepoints in the BMP use only 1 codeunit (a single WideChar), codepoints outside the BMP use 2 codeunits (2 WideChars).  This is why most Unicode-based programming languages and frameworks are based on UTF-16.  It strikes a nicer balance between ease of use and memory usage than other UTFs do.  UTF-8 is generally better for reducing memory usage of Unicode data (not always, depending on the language being encoded), but at the cost of more complex processing.  UTF-32 is generally better for processing Unicode data, but at the cost of higher memory usage.

If your source Unicode data is already in UTF-16 (which most Unicode APIs are), it is best to leave it in UTF-16 and process it as-is instead of converting it to a less efficient format for processing.
« Last Edit: August 23, 2017, 09:57:08 pm by Remy Lebeau »
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: Moving several units from UTF-16 to UTF-8
« Reply #17 on: August 24, 2017, 01:19:01 am »
taazz, you should really learn even the basics of Unicode before writing instructions.  %)

I thought I did. No? Well lets learn something then.

1) encoding is the way the characters are represented in a string.
No, only codepoints are encoded. A "character" can mean many things. Details below.

I disagree a character is a very specific thing, it can't mean a lot of things, the letter A is a character a Chinese ideogram is a character, the tab key (ascii 09) is not a character it a control "character" and character only by association not functionality. aka it happens to be part of the character set so we call it a character for simplicity it never was.

Quote
2) code point is a type/variable etc with the smallest possible length a variable length encoding can have so a code point in utf8 is 1 byte long in utf16 is 2 bytes long. So a code point on utf8 is a byte and in utf16 is a word.
No! You just explained a codeunit. It is the smallest "atom" in Unicode.

No a code unit is word or byte or dword. a code point has the size and type of the code unit but the value of character on the table, but hey lets go with your definition (your as in all of you not you specifically).

Quote
3) character it has a minimum size of 1 code point and a maximum based on the encoding. In the case of a utf8 character it can have a size of 1 up to 6 code points (if memory serves me right), in the case of a utf16 character it has a size of 1 up to 2 code points.
Now you explained a codepoint. In a variable length encoding a codepoint consists of one or more codeunits. A "character" is a fuzzy term and can mean many things.

No sorry I define the size of a character a code point does not have variable length.

Quote
4) length of a unicode string I will only use length to refere to the size of a string in code points so a length of 10 for a utf16 string can have from 5 to 10 characters with a memory size of 20 bytes, on a utf8 string it can have from 2 to 10 characters with a memory size of 10 bytes.
No.

Quote
Since a code point in utf16 is 2 bytes long this is to be expected, any random access to a string, accesses code points and not characters.
No, it accesses codeunits.

if you say so.

Quote
The only reliable way to access characters in any variable length encoding is to use a sequential access that would make most processing a bit slow though. I think that converting a random code point to a character is way easier in utf16 than it is in utf8.
Also codepoints require sequential access. It is not any easier in UTF-16 than it is in UTF-8 because they are both variable width encodings. For UCS-2 it would be easier but UCS-2 is obsolete now. More than half of codepoints are already outside BMP and the number grows as Unicode is extended. Even MS Windows has supported full Unicode for almost 18 years now.
I disagree it far easier to determine if a code point is the start or the end or a none of the above in utf16 it is far more convoluted in utf8 although from a logic point of view utf8 only repeats a couple of steps a few more times

Quote
I have no idea, lcl 1.6.0 made some pretty aggressive changes on the unicodestring data type that started a conversion of the affected application to C# at work so I never had the chance to look closely at the underline code. You might not converting at all. In any case I'll take a closer look on your test case, probably at the week end.
Again totally false information. How is this possible?
LazUtils in Lazarus 1.6.0 made aggressive changes on AnsiString. UnicodeString is not affected.
I have no idea on how or why as already mentioned I never looked at the problem close enough and I'm not inclined to look now either. I have enough problems finding time to work on pascal as it is I'd rather spend it on creating instead of correcting.
I have improved the wiki page that explains it. Please take a look:
 http://wiki.freepascal.org/Unicode_Support_in_Lazarus
The solution turned out to be amazingly compatible with Delphi at source level when few simple rules are followed.

LazUtils package also has unit LazUnicode which allows writing encoding agnostic code. Such code works 100% in Delphi and in Lazarus, using both UTF-16 and UTF-8 encodings. Please take a look.

Thanks. I'll take a close look when I'll write the sql editor for turbobird atsynedit sounds like a proper fit and your unit will be a God send to extend the support if the need arises.

---

This is copied from my post in Lazarus mailing list.
The word "character" can mean the following things when people communicate about encodings and Unicode:

1. CodeUnit — Represented by Pascal type "Char".

2. CodePoint — all the arguments about one encoding's supremacy over
another deal with CodePoints. Yes, UTF-8, UTF-16, UTF-32 etc. all only
encode CodePoints.

Sorry I see no real difference between code point and code unit. For me they are equivalent.


3. Abstract Unicode character — like 'WINE GLASS'.
(There should have been the actual wineglass symbol but this forum SW does not support Unicode and I had to remove it.)

That is a character I agree.

4. Coded Unicode character — "U" + a unique number, like U+1F377. This
is what "character" means in Unicode Standard.

This should not be in this list at all it is only an input/definition method of a character and it is only relevant for parsers, the same way the html encodes %charcode% or the same way the two characters 1 and 5 represent the number fifteen in code.


5. User-perceived character — Whatever the end user thinks of as a character.
This is language dependent. For instance, ‘ch’ is two letters in
English but one letter in Czech and Slovak.
Many more complexities are involved here, including decomposed codepoints.

those are two characters which are read as a single letter in Czech and Slovak, are those characters used alone also? Do they occupy the same space as a single character or as two(visually, I'm mostly curious, does not make any real difference)?

6. Grapheme cluster

Ok this is unknown to me are you talking about the same thing that engkin posted a couple of posts back about compound letters?


7. Glyph — related to fonts.
erm are you talking about the visual representation of the character here? eg gothic letters or times, roman etc? if yes those are not part of the encoding lets not make thing more complicated for now at least.
So, number 4. is the official Unicode "character".
Otherwise the most useful meanings are 1. "CodeUnit" for programmers
and 5. "User-perceived character" for everybody else.
yes and no. The character is definetely the number 5, this is the target. The goal of the encoding is to define the ID of its visual character and what is expected each font to show for that ID with some leeway ee a capital U should be recognizable as the letter capital U a wineglass should be recognizable as wine glass you can use any wine glass you can think of in your font but you should not use a bear mag.
Everything else is the encoding of that information.

There seems to be a bit of confusion of what is a character and what is letter which is understandable after all characters started their lives as representation for the letters.

At this point I would really like to ask to keep the number of definitions as low as possible but I have a filling that I'm alone in this, so I'll stick to my guns for now and I really hope you'll manage to change my mind (it means I learned something new, that is always fun).
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: Moving several units from UTF-16 to UTF-8
« Reply #18 on: August 24, 2017, 01:26:29 am »
some definitions to help use communicate better.
1) encoding is the way the characters are represented in a string.
2) code point is a type/variable etc with the smallest possible length a variable length encoding can have so a code point in utf8 is 1 byte long in utf16 is 2 bytes long. So a code point on utf8 is a byte and in utf16 is a word.
A code point is represented with a hex number preceded by U+.
Like: U+0301  COMBINING ACUTE ACCENT
It could occupy any number of bytes (not just one byte in utf8).

3) character it has a minimum size of 1 code point and a maximum based on the encoding. In the case of a utf8 character it can have a size of 1 up to 6 code points (if memory serves me right), in the case of a utf16 character it has a size of 1 up to 2 code points.
A character is one or more code points. Here is an example I had posted somewhere else before:


"U+1E09  LATIN SMALL LETTER C WITH CEDILLA AND ACUTE"

can be represented by 2 codepoints:

ḉ = ç + ́

"U+00E7  LATIN SMALL LETTER C WITH CEDILLA + U+0301  COMBINING ACUTE ACCENT"

or a different 2 codepoints:

ḉ = ć + ̧

"U+0107  LATIN SMALL LETTER C WITH ACUTE + U+0327  COMBINING CEDILLA"

Or maybe by 3 codepoints:

"U+0063  LATIN SMALL LETTER C + U+0327  COMBINING CEDILLA + U+0301  COMBINING ACUTE ACCENT"

The order could be different:

"U+0063  LATIN SMALL LETTER C + U+0301  COMBINING ACUTE ACCENT + U+0327  COMBINING CEDILLA"

Fortunately, 3 is not the maximum number for codepoints.

You are confusing characters with letters. Although I recognize the problem that arise from the above example and I do sympathize, that is a 2 or 3 character long letter, not a character.

4) length of a unicode string I will only use length to refere to the size of a string in code points so a length of 10 for a utf16 string can have from 5 to 10 characters with a memory size of 20 bytes, on a utf8 string it can have from 2 to 10 characters with a memory size of 10 bytes.
Based on the previous correction, 20 bytes could represent 1 to 10 utf16 characters.
Do you have any example I could look at? that is most interesting.
« Last Edit: August 24, 2017, 01:41:54 am by taazz »
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4467
  • I like bugs.
Re: Moving several units from UTF-16 to UTF-8
« Reply #19 on: August 24, 2017, 01:54:43 am »
Then you have your work cut out for you, because processing UTF-8 is more difficult than processing UTF-16 when dealing with non-ASCII characters.
That is irrelevant because he wants to support full Unicode. For that processing UTF-8 is not more difficult.

Quote
In UTF-16, Unicode codepoints in the BMP use only 1 codeunit (a single WideChar), codepoints outside the BMP use 2 codeunits (2 WideChars).  This is why most Unicode-based programming languages and frameworks are based on UTF-16.  It strikes a nicer balance between ease of use and memory usage than other UTFs do.
You also promote buggy programming that supports only BMP. How is this possible?  :(
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: Moving several units from UTF-16 to UTF-8
« Reply #20 on: August 24, 2017, 02:38:50 am »
Ok it just hit me. When you say code point do you guys mean the "ID" of the character? ee 45 is the code point of capital A in ascii? and Code unit is the size in bytes ee 1 byte for ascii?
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

EganSolo

  • Sr. Member
  • ****
  • Posts: 290
Re: Moving several units from UTF-16 to UTF-8
« Reply #21 on: August 24, 2017, 07:24:00 am »
Man, I feel like Cypher in the Matrix and I wish I didn't take the blue pill and that I stayed blissfully oblivious of the Unicode complexities  ;)

But here I am and here's the deal:

  • I've done quite a bit of reading on utf-8 versus utf-16 and I could summarize it as: The world versus Microsoft. It strikes me as annother Emacs versus vi or Os/2 versus Windows or Nix/windows, etc. This conversation is sterile and leads nowhere.
  • Fact: The implementation of UTF-8 in Lazarus is complete. It may have some bugs but at least it implements unicode.
  • Fact: The implementation of unicode in ATSynEdit is not: It's a subset of unicode.  Alexey states this fact quite clearly on his wiki and is not hiding it: Not supported Unicode code points >0xFFFF, caret pos incorrect. I read it and did not understand it and well did the stupid thing: I ignored it.
  • Fact: TSynEdit's architecture is inscrutable. Please see the attached UML for TSynCustomFoldHighlighter: that small component of TSynEdit is more complex than all of TATSynEdit. The logic behind its highlighter is very difficult to follow so for anyone trying to build a new syntax, this component is a killer.

Now that I understand a bit more what the issue is, I'd like to find a way forward. One of the blockers is the fact that I can't seem to find a unit like Character but for UTF-8. So, how does one determine what a letter or a blank is in UTF-8? I gather the digits are safe :)

That's not a very difficult problem to solve for: there ought to be a table of all known utf-8 code points with some associated meta-data that could say: Is letter or is punctuation and if it is an upper or lower or neutral or some such thing.

I've looked around and couldn't find it. I'll keep looking. And JuhaMenninen, I re-read both wiki-pages you provided and did not find any reference made to this issue.

This sucks. It's a distraction from where I wanted to be by now but I'm glad I found out about this early on.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Moving several units from UTF-16 to UTF-8
« Reply #22 on: August 24, 2017, 08:01:28 am »
So, how does one determine what a letter or a blank is in UTF-8?

UnicodeData.txt from ftp://www.unicode.org/Public/UCD/latest/ucd/

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Moving several units from UTF-16 to UTF-8
« Reply #23 on: August 24, 2017, 09:05:16 am »
You are confusing characters with letters. Although I recognize the problem that arise from the above example and I do sympathize, that is a 2 or 3 character long letter, not a character.
Let me help you with this. A character could represent a letter, number, emoji... etc?

Do you have any example I could look at? that is most interesting.
Hold a second.... Here you go:
Code: Pascal  [Select][+][-]
  1. uses
  2.   ..., Windows;
  3.  
  4. var
  5.   i: integer;
  6.   us: UnicodeString;
  7.   s: string;
  8. ...
  9.  
  10.   s := UnicodeToUTF8($0041);
  11.   for i := $301 to $301+8 do
  12.     s := s+UnicodeToUTF8(i);
  13.  
  14.   us := s;
  15.   MessageBoxW(0,@us[1],'Test',0);
  16.  

Here is the result:

Á̂̃̄̅̆̇̈̉

It looks ugly, I know. It meant to be ugly. Try to copy it. Can you copy the "letter" A alone?

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4467
  • I like bugs.
Re: Moving several units from UTF-16 to UTF-8
« Reply #24 on: August 24, 2017, 11:13:04 am »
Ok it just hit me. When you say code point do you guys mean the "ID" of the character? ee 45 is the code point of capital A in ascii?
"ID" is the "U" + a unique number, like U+1F377. I called it "character" in the list earlier but it is confusing again. Maybe it should be called "codepoint" and then the implemented encoded things should be called "encoded codepoint".
Anyway, only codepoints are encoded and their width varies when either UTF-8 or UTF-16 is used. This is very unambiguous.

Quote
and Code unit is the size in bytes ee 1 byte for ascii?
No, in practical terms a codeunit is the Pascal type Char, either AnsiChar or WideChar depending on compiler mode. When you iterate over a string in a for-loop and do:
Code: Pascal  [Select][+][-]
  1. ch := S[i];
then you get a codeunit. It is very usefull also with variable width encodings. For example code like this works with both UTF-8 and UTF-16:
Code: Pascal  [Select][+][-]
  1. procedure ParseAscii(Txt: string);
  2. var
  3.   i: Integer;
  4. begin
  5.   for i:=1 to Length(Txt) do
  6.     case Txt[i] of
  7.       '(': PushOpenBracketPos(i);
  8.       ')': HandleBracketText(i);
  9.     end;
  10. end;
For the same reason most XML- and HTML-parsers continue to work. All tags are ASCII and the data between them is just copied as-is.

engkin gave you some good examples.
BTW, if you call a codepoint as "character" then how do you call the combining codepoints?

From the earlier post:
Quote
Sorry I see no real difference between code point and code unit. For me they are equivalent.
Codeunit has a fixed width (Char).
An encoded codepoint has variable width when either UTF-8 or UTF16 is used. A codepoint is then composed of one or more codeunits. With UTF16 the 2-codeunit case is called "surrogate pair".

Quote
yes and no. The character is definetely the number 5 ("User-perceived character"), this is the target. The goal of the encoding is to define the ID of its visual character and what is expected each font to show for that ID with some leeway ee a capital U should be recognizable as the letter capital U a wineglass should be recognizable as wine glass you can use any wine glass you can think of in your font but you should not use a bear mag.
Everything else is the encoding of that information.
You are clearly ignorant of the complexities of Unicode. Only a codepoint is encoded, not a "User-perceived character".
A "User-perceived character" can be composed of many codepoints using complex rules. The combining codepoints for accents are only a part of those rules.

The funny thing is that encodings are just a small part of Unicode and only relevant with codepoints.
They are easy to get right regardless of encoding. The complexity of Unicode is elsewhere.
Yet, people continue to argue about encodings like it was something important. Why?

When I listed the 7 possible meanings for "character" I had 2 purposes.
1. To show how ambiguous the word is with Unicode.
2. To show how small role the encodings play.

Getting the codepoints right should be considered the minimum requirement for Unicode support.
Still people even in this forum encourage writing buggy code that supports only BMP, thus ignoring ~ half of codepoints.
Do you guys have the same attitude for other bugs? For example is a bookkeeping app good if it calculates the sums wrong only sometimes?   %)
« Last Edit: August 25, 2017, 12:28:20 am by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

 

TinyPortal © 2005-2018