Recent

Author Topic: Moving several units from UTF-16 to UTF-8  (Read 8935 times)

EganSolo

  • Sr. Member
  • ****
  • Posts: 290
Moving several units from UTF-16 to UTF-8
« on: August 22, 2017, 05:14:29 pm »
I am leveraging ATSynEdit as the editor in an app I'm writing.
ATSynEdit relies on UnicodeString as the main vehicle for strings. It handles unicode characters that can be encoded in exactly two bytes. Therefore, it is not a complete solution for unicode.

I'm faced with a few options:
  • Leverage TSynEdit: Very difficult to write a highlighter from scratch with TSynEdit. TATSynEdit has a clearner architecture and is easier to handle. I have another thread within TSynEdit where I tried to understand how this works and how to write a highlighter and it is not straightfoward or easy.
  • Edit in UTF-16 and transform to UTF-8 whenever I need to parse/compile or highlight: It's workable but it feels kludgy. Plus I'll need to understand the performance impact.
  • Write my own Editor in utf-8: That's the nuclear option that I'd rather avoid.I would prefer to leverage what exists already.
  • Refactor TATSynEdit to work with utf-8: I need to determine the level of effort required to do so. In this regard, does anyone have a list of steps I would need to follow to convert a unit from utf-16 to utf-8? Better yet, does anyone have an automated converter than can get me partially there?
  • Any other options I'm not thinking of?


taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: Moving several units from UTF-16 to UTF-8
« Reply #1 on: August 22, 2017, 05:44:56 pm »
I am leveraging ATSynEdit as the editor in an app I'm writing.
ATSynEdit relies on UnicodeString as the main vehicle for strings. It handles unicode characters that can be encoded in exactly two bytes. Therefore, it is not a complete solution for unicode.
You lost me here, any code fragments to understand what you are talking about?
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

EganSolo

  • Sr. Member
  • ****
  • Posts: 290
Re: Moving several units from UTF-16 to UTF-8
« Reply #2 on: August 22, 2017, 06:44:22 pm »
Sure thing Taazz:

The following code is taken from a unit called ATStringProc
Code: Pascal  [Select][+][-]
  1. type
  2.   atString = UnicodeString;
  3.   atChar = WideChar;
  4.   PatChar = PWideChar;
  5.  
  6. function SCharUpper(ch: atChar): atChar;
  7. function SCharLower(ch: atChar): atChar;
  8. function SCaseTitle(const S, SWordChars: atString): atString;
  9. function SCaseInvert(const S: atString): atString;
  10. function SCaseSentence(const S, SWordChars: atString): atString;
  11.  

I would need to switch atChar and atString from WideChar and Unicode to String. Then I would need to go through each of these functions and determine what I need to change to make them work with UTF-8.

For instance: SCharUpper would be changed from
Code: Pascal  [Select][+][-]
  1. function SCharUpper(ch: atChar): atChar;
  2. begin
  3.   Result:= UnicodeUpperCase(ch)[1];
  4. end;
  5.  

To
Code: Pascal  [Select][+][-]
  1. function SCharUpper(ch: atChar): atChar;
  2. begin
  3.   Result:= UpperCase(ch);
  4. end;
  5.  

That's a simple example, of course. I need to scan each of the units and determine where and what gets impacted and refactor it.

Does this make sense or am I missing something here?

taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: Moving several units from UTF-16 to UTF-8
« Reply #3 on: August 22, 2017, 06:52:44 pm »
Well I was asking for the code fragments that lead you to the conclusion that atsynedit does not support full unicode. My thinking is that it might be easier to extend it to add support of the missing characters/plane and in case it is not you will probably have a starting point to note which parts of the code needs to be converted to utf8 and using the existing support of utf8 in lazarus you might be able to convert it easily to utf8. In any case I'll have to take a closer look on atsynedit in the future.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

EganSolo

  • Sr. Member
  • ****
  • Posts: 290
Re: Moving several units from UTF-16 to UTF-8
« Reply #4 on: August 22, 2017, 07:38:05 pm »
I did confirm with Alexey that ATSynEdit does not support full Unicode.

The trouble for me is that the rest of my framework is UTF-8 including the highlighter and the parser. That's why I was asking the question: what would it take to convert from UTF-16 to UTF-8? It's not so much whether TATSynEdit supports full unicode or not but rather that UTF-8 is where Lazarus is.

taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: Moving several units from UTF-16 to UTF-8
« Reply #5 on: August 22, 2017, 08:58:18 pm »
I did confirm with Alexey that ATSynEdit does not support full Unicode.

The trouble for me is that the rest of my framework is UTF-8 including the highlighter and the parser. That's why I was asking the question: what would it take to convert from UTF-16 to UTF-8? It's not so much whether TATSynEdit supports full unicode or not but rather that UTF-8 is where Lazarus is.
there is no general guidelines that one can follow that will convert a library from one encoding to an other. Each case is unique as far as I know. In this case I would ask Alexey what are the key points in his code that need to be changed to utf8. Hopefully they are not many, In any other case you have to analyze the code your self and find out. Since I'm in the opposite side than yours ee my framework is utf16 and I simple dislike utf8 I'll take a closer look on atsynedit in the near future. I might be able to answer the question then or even convert it for you as a learning experience I just can't answer the "when"question with any certainty.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

EganSolo

  • Sr. Member
  • ****
  • Posts: 290
Re: Moving several units from UTF-16 to UTF-8
« Reply #6 on: August 22, 2017, 11:54:03 pm »
Thanks for the feedback, Taazz. This make sense.
By the way, Alexey indicated that when accessing a random character in a string or when copying, he is assuming that every character is a wide-char. Also, when traversing a string, he is using a for loop which assumes every character occupies exactly two bytes.

In the meantime, I've written a quick test harness to determine what the penalty would be to convert between UTF-8 and UTF-16. I've attached it to this post in case it is of use to others. It uses both SynEdit and ATSynEdit.
It basically creates 5,000 random utf-16 strings and then translate them 50 times to UTF-8 and display them inside the TSynEdit component. It also does the inverse.

I've ran each translation loop 50 times to average out the duration. Basically it amounts to 45 ms for a round-trip translation which is about 0.0045ms per string (on average).

This penalty will be negligible in most cases. 

Bottom line: the option of translation is still the most expedient until You, or I or someone port ATSynEdit over to UTF-8.

taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: Moving several units from UTF-16 to UTF-8
« Reply #7 on: August 23, 2017, 02:06:54 am »
some definitions to help use communicate better.
1) encoding is the way the characters are represented in a string.
2) code point is a type/variable etc with the smallest possible length a variable length encoding can have so a code point in utf8 is 1 byte long in utf16 is 2 bytes long. So a code point on utf8 is a byte and in utf16 is a word.
3) character it has a minimum size of 1 code point and a maximum based on the encoding. In the case of a utf8 character it can have a size of 1 up to 6 code points (if memory serves me right), in the case of a utf16 character it has a size of 1 up to 2 code points.
4) length of a unicode string I will only use length to refere to the size of a string in code points so a length of 10 for a utf16 string can have from 5 to 10 characters with a memory size of 20 bytes, on a utf8 string it can have from 2 to 10 characters with a memory size of 10 bytes.
Thanks for the feedback, Taazz. This make sense.
By the way, Alexey indicated that when accessing a random character in a string or when copying, he is assuming that every character is a wide-char. Also, when traversing a string, he is using a for loop which assumes every character occupies exactly two bytes.
Since a code point in utf16 is 2 bytes long this is to be expected, any random access to a string, accesses code points and not characters. The only reliable way to access characters in any variable length encoding is to use a sequential access that would make most processing a bit slow though. I think that converting a random code point to a character is way easier in utf16 than it is in utf8.

In the meantime, I've written a quick test harness to determine what the penalty would be to convert between UTF-8 and UTF-16. I've attached it to this post in case it is of use to others. It uses both SynEdit and ATSynEdit.
It basically creates 5,000 random utf-16 strings and then translate them 50 times to UTF-8 and display them inside the TSynEdit component. It also does the inverse.

I've ran each translation loop 50 times to average out the duration. Basically it amounts to 45 ms for a round-trip translation which is about 0.0045ms per string (on average).

This penalty will be negligible in most cases. 

Bottom line: the option of translation is still the most expedient until You, or I or someone port ATSynEdit over to UTF-8.
I have no idea, lcl 1.6.0 made some pretty aggressive changes on the unicodestring data type that started a conversion of the affected application to C# at work so I never had the chance to look closely at the underline code. You might not converting at all. In any case I'll take a closer look on your test case, probably at the week end.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1312
    • Lebeau Software
Re: Moving several units from UTF-16 to UTF-8
« Reply #8 on: August 23, 2017, 02:47:14 am »
The following code is taken from a unit called ATStringProc

Code: Pascal  [Select][+][-]
  1. type
  2.   atString = UnicodeString;
  3.   atChar = WideChar;
  4.   PatChar = PWideChar;
  5.  
  6. function SCharUpper(ch: atChar): atChar;
  7. function SCharLower(ch: atChar): atChar;
  8. function SCaseTitle(const S, SWordChars: atString): atString;
  9. function SCaseInvert(const S: atString): atString;
  10. function SCaseSentence(const S, SWordChars: atString): atString;
  11.  

I would need to switch atChar and atString from WideChar and Unicode to String.

Why not just update the atChar functions to work with UTF-16 strings instead?  And then you can update the call sites to take UTF-16 surrogates into account when creating substrings to pass to the functions.  That is going to be far less work than switching everything to UTF-8.  You would have to change the functions to use strings anyway to handle 2-4 byte UTF-8 sequences, so you may as well do the same work to handle 2- and 4- byte UTF-16 sequences instead.

For instance: SCharUpper would be changed from
Code: Pascal  [Select][+][-]
  1. function SCharUpper(ch: atChar): atChar;
  2. begin
  3.   Result:= UnicodeUpperCase(ch)[1];
  4. end;
  5.  

To
Code: Pascal  [Select][+][-]
  1. function SCharUpper(ch: atChar): atChar;
  2. begin
  3.   Result:= UpperCase(ch);
  4. end;
  5.  

Or simply:

Code: Pascal  [Select][+][-]
  1. function SCharUpper(S: atString): atString;
  2. begin
  3.   Result := UnicodeUpperCase(S);
  4. end;
  5.  

And then pass in a complete UTF-16 string for a single "character".
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Moving several units from UTF-16 to UTF-8
« Reply #9 on: August 23, 2017, 03:51:06 am »
some definitions to help use communicate better.
1) encoding is the way the characters are represented in a string.
2) code point is a type/variable etc with the smallest possible length a variable length encoding can have so a code point in utf8 is 1 byte long in utf16 is 2 bytes long. So a code point on utf8 is a byte and in utf16 is a word.
A code point is represented with a hex number preceded by U+.
Like: U+0301  COMBINING ACUTE ACCENT
It could occupy any number of bytes (not just one byte in utf8).

3) character it has a minimum size of 1 code point and a maximum based on the encoding. In the case of a utf8 character it can have a size of 1 up to 6 code points (if memory serves me right), in the case of a utf16 character it has a size of 1 up to 2 code points.
A character is one or more code points. Here is an example I had posted somewhere else before:


"U+1E09  LATIN SMALL LETTER C WITH CEDILLA AND ACUTE"

can be represented by 2 codepoints:

ḉ = ç + ́

"U+00E7  LATIN SMALL LETTER C WITH CEDILLA + U+0301  COMBINING ACUTE ACCENT"

or a different 2 codepoints:

ḉ = ć + ̧

"U+0107  LATIN SMALL LETTER C WITH ACUTE + U+0327  COMBINING CEDILLA"

Or maybe by 3 codepoints:

"U+0063  LATIN SMALL LETTER C + U+0327  COMBINING CEDILLA + U+0301  COMBINING ACUTE ACCENT"

The order could be different:

"U+0063  LATIN SMALL LETTER C + U+0301  COMBINING ACUTE ACCENT + U+0327  COMBINING CEDILLA"

Fortunately, 3 is not the maximum number for codepoints.

4) length of a unicode string I will only use length to refere to the size of a string in code points so a length of 10 for a utf16 string can have from 5 to 10 characters with a memory size of 20 bytes, on a utf8 string it can have from 2 to 10 characters with a memory size of 10 bytes.
Based on the previous correction, 20 bytes could represent 1 to 10 utf16 characters.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Moving several units from UTF-16 to UTF-8
« Reply #10 on: August 23, 2017, 04:24:29 am »
Why not just update the atChar functions to work with UTF-16 strings instead?  And then you can update the call sites to take UTF-16 surrogates into account when creating substrings to pass to the functions.  That is going to be far less work than switching everything to UTF-8.  You would have to change the functions to use strings anyway to handle 2-4 byte UTF-8 sequences, so you may as well do the same work to handle 2- and 4- byte UTF-16 sequences instead.

Would not switching to UTF-32 be easier?

Thaddy

  • Hero Member
  • *****
  • Posts: 14205
  • Probably until I exterminate Putin.
Re: Moving several units from UTF-16 to UTF-8
« Reply #11 on: August 23, 2017, 09:09:21 am »
Why not just update the atChar functions to work with UTF-16 strings instead?  And then you can update the call sites to take UTF-16 surrogates into account when creating substrings to pass to the functions.  That is going to be far less work than switching everything to UTF-8.  You would have to change the functions to use strings anyway to handle 2-4 byte UTF-8 sequences, so you may as well do the same work to handle 2- and 4- byte UTF-16 sequences instead.

Would not switching to UTF-32 be easier?
Yup. That's what I thought....or stick to UCS2

He wants to transfer from a kludge (after 1992) to an even bigger kludge..
Maybe his platform is too fast? >:( :D
« Last Edit: August 23, 2017, 09:11:01 am by Thaddy »
Specialize a type, not a var.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
Re: Moving several units from UTF-16 to UTF-8
« Reply #12 on: August 23, 2017, 10:35:35 am »
taazz, you should really learn even the basics of Unicode before writing instructions.  %)

1) encoding is the way the characters are represented in a string.
No, only codepoints are encoded. A "character" can mean many things. Details below.

Quote
2) code point is a type/variable etc with the smallest possible length a variable length encoding can have so a code point in utf8 is 1 byte long in utf16 is 2 bytes long. So a code point on utf8 is a byte and in utf16 is a word.
No! You just explained a codeunit. It is the smallest "atom" in Unicode.

Quote
3) character it has a minimum size of 1 code point and a maximum based on the encoding. In the case of a utf8 character it can have a size of 1 up to 6 code points (if memory serves me right), in the case of a utf16 character it has a size of 1 up to 2 code points.
Now you explained a codepoint. In a variable length encoding a codepoint consists of one or more codeunits. A "character" is a fuzzy term and can mean many things.

Quote
4) length of a unicode string I will only use length to refere to the size of a string in code points so a length of 10 for a utf16 string can have from 5 to 10 characters with a memory size of 20 bytes, on a utf8 string it can have from 2 to 10 characters with a memory size of 10 bytes.
No.

Quote
Since a code point in utf16 is 2 bytes long this is to be expected, any random access to a string, accesses code points and not characters.
No, it accesses codeunits.

Quote
The only reliable way to access characters in any variable length encoding is to use a sequential access that would make most processing a bit slow though. I think that converting a random code point to a character is way easier in utf16 than it is in utf8.
Also codepoints require sequential access. It is not any easier in UTF-16 than it is in UTF-8 because they are both variable width encodings. For UCS-2 it would be easier but UCS-2 is obsolete now. More than half of codepoints are already outside BMP and the number grows as Unicode is extended. Even MS Windows has supported full Unicode for almost 18 years now.

Quote
I have no idea, lcl 1.6.0 made some pretty aggressive changes on the unicodestring data type that started a conversion of the affected application to C# at work so I never had the chance to look closely at the underline code. You might not converting at all. In any case I'll take a closer look on your test case, probably at the week end.
Again totally false information. How is this possible?
LazUtils in Lazarus 1.6.0 made aggressive changes on AnsiString. UnicodeString is not affected.
I have improved the wiki page that explains it. Please take a look:
 http://wiki.freepascal.org/Unicode_Support_in_Lazarus
The solution turned out to be amazingly compatible with Delphi at source level when few simple rules are followed.

LazUtils package also has unit LazUnicode which allows writing encoding agnostic code. Such code works 100% in Delphi and in Lazarus, using both UTF-16 and UTF-8 encodings. Please take a look.

---

This is copied from my post in Lazarus mailing list.
The word "character" can mean the following things when people communicate about encodings and Unicode:

1. CodeUnit — Represented by Pascal type "Char".

2. CodePoint — all the arguments about one encoding's supremacy over
another deal with CodePoints. Yes, UTF-8, UTF-16, UTF-32 etc. all only
encode CodePoints.

3. Abstract Unicode character — like 'WINE GLASS'.
(There should have been the actual wineglass symbol but this forum SW does not support Unicode and I had to remove it.)

4. Coded Unicode character — "U" + a unique number, like U+1F377. This
is what "character" means in Unicode Standard.

5. User-perceived character — Whatever the end user thinks of as a character.
This is language dependent. For instance, ‘ch’ is two letters in
English but one letter in Czech and Slovak.
Many more complexities are involved here, including decomposed codepoints.

6. Grapheme cluster

7. Glyph — related to fonts.

So, number 4. is the official Unicode "character".
Otherwise the most useful meanings are 1. "CodeUnit" for programmers
and 5. "User-perceived character" for everybody else.
« Last Edit: August 23, 2017, 10:42:51 am by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
Re: Moving several units from UTF-16 to UTF-8
« Reply #13 on: August 23, 2017, 01:00:42 pm »
... my framework is utf16 and I simple dislike utf8
I wonder why. They are equally easy or difficult to use because they are both variable width encodings. I like them both when they are implemented correctly.
The problem is that many people support only UCS-2 but then falsely claim they support UTF-16. That is just plain wrong!
There are many programs that are advertised as Unicode aware but are not. The reason is always this UTF-16 / UCS-2 issue, either by ignorance or by lazyness.
It is now so common that it is really annoying!
There is no improvement in sight. The wrong buggy usage of UTF-16 is widely advertised and encouraged around internet, and even on this forum.

ATSynEdit was discussed here. It may not be widely used but this Simple Machines Forum SW is widely used. It does not support Unicode! In my last post I tried to include an emoji outside BMP but could not.

I would like to advertise my LazUnicode unit which supports both UTF-8 and UTF-16.
Then you must get codepoints right always because the multi-byte codepoints are so common in UTF-8.
There are many levels of Unicode support because it is a complex standard, but doing codepoints right should be considered the minimum requirement for Unicode support.
The next level is combining codepoints (accents etc.).
« Last Edit: August 23, 2017, 01:15:32 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Moving several units from UTF-16 to UTF-8
« Reply #14 on: August 23, 2017, 06:35:16 pm »
So, number 4. is the official Unicode "character".
Otherwise the most useful meanings are 1. "CodeUnit" for programmers
and 5. "User-perceived character" for everybody else.
For programmers:
Using UTF8 or UTF16, a string S is indexed based on its CodeUnits. S[N] is the Nth CodeUnit.

With UTF32 a string S is indexed based on its CodePoints.
S[N] is the Nth CodePoint.
« Last Edit: August 23, 2017, 06:39:56 pm by engkin »

 

TinyPortal © 2005-2018