Recent

Author Topic: Entering large Unicode numbers  (Read 15987 times)

Thaddy

  • Hero Member
  • *****
  • Posts: 8197
Re: Entering large Unicode numbers
« Reply #15 on: January 17, 2017, 01:35:44 pm »
thaddy,

So what would be the math for that?

Rick

The math is easy. UTF8Key should be capable to hold 4 bytes. Not two. UnicodetoUtf8 itself handles the surrogate pair correctly.
Read the manuals and if you are a professional get a proper education in computer science. Makes the forum a lot cleaner.

rick2691

  • Sr. Member
  • ****
  • Posts: 374
Re: Entering large Unicode numbers
« Reply #16 on: January 17, 2017, 01:44:11 pm »
I don't think that I would call it "correctly".

Additionally, I did another series and got different increments. This one decrements by 2's for a while, then decrements by 1, the it start incrementing by 2's. When reloaded it still displays the characters in the same improper order as they were before being saved.

\u-10238?\u-8960?\u-10238?\u-8958?\u-10238?\u-8956?\u-10238?\u-8954?\u-10238?\u-8953?\u-10238?\u-8955?\u-10238?\u-8957?\u-10238?\u-8959?

Rick
Windows 10, LAZ 1.6.4, FPC 3.0.2, SVN 54278, i386-win32-win32/win64, forms use windows unit

Thaddy

  • Hero Member
  • *****
  • Posts: 8197
Re: Entering large Unicode numbers
« Reply #17 on: January 17, 2017, 01:47:20 pm »
Aha. Found it. UnicodeToUTF8Inline in LazUTF8 is buggy and CAN'T handle that code point. UnicodeToUTF8 calls UnicodeToUTF8Inline...
It can't handle high surrogate pairs.
The UnicodeToUTF8 routines from FPC itself, in ustrings, are correct and CAN handle that codepoint.
These can handle high surrogate pairs.

Just examining the code in ustrings.inc and in LazUTF8 immediately makes clear where the bug is, btw:
Right way:
Code: Pascal  [Select]
  1. // ustrings.inc snippet:
  2.              $800..$d7ff,$e000..$ffff:
  3.                 begin
  4.                   if j+2>=MaxDestBytes then
  5.                     break;
  6.                   Dest[j]:=char($e0 or (lw shr 12));
  7.                   Dest[j+1]:=char($80 or ((lw shr 6) and $3f));
  8.                   Dest[j+2]:=char($80 or (lw and $3f));
  9.                   inc(j,3);
  10.                 end;
  11.               $d800..$dbff:
  12.                 {High Surrogates}
  13.                 begin
  14.                   if j+3>=MaxDestBytes then
  15.                     break;
  16.                   if (i+1<sourcechars) and
  17.                      (word(Source[i+1]) >= $dc00) and
  18.                      (word(Source[i+1]) <= $dfff) then
  19.                     begin
  20.                       { $d7c0 is ($d800 - ($10000 shr 10)) }
  21.                       lw:=(longword(lw-$d7c0) shl 10) + (ord(source[i+1]) xor $dc00);
  22.                       Dest[j]:=char($f0 or (lw shr 18));
  23.                       Dest[j+1]:=char($80 or ((lw shr 12) and $3f));
  24.                       Dest[j+2]:=char($80 or ((lw shr 6) and $3f));
  25.                       Dest[j+3]:=char($80 or (lw and $3f));
  26.                       inc(j,4);
  27.                       inc(i);
  28.                     end;
  29.                 end;
  30.               end;
  31.             inc(i);
  32.           end;

VS wrong way:
Code: Pascal  [Select]
  1. //lazutf8 snippet:
  2.    $800..$ffff:
  3.       begin
  4.         Result:=3;
  5.         Buf[0]:=char(byte($e0 or (CodePoint shr 12)));
  6.         Buf[1]:=char(byte((CodePoint shr 6) and $3f) or $80);
  7.         Buf[2]:=char(byte(CodePoint and $3f) or $80);
  8.       end;
  9.     $10000..$10ffff:
  10.       begin
  11.         Result:=4;
  12.         Buf[0]:=char(byte($f0 or (CodePoint shr 18)));
  13.         Buf[1]:=char(byte((CodePoint shr 12) and $3f) or $80);
  14.         Buf[2]:=char(byte((CodePoint shr 6) and $3f) or $80);
  15.         Buf[3]:=char(byte(CodePoint and $3f) or $80);
  16.       end;
  17.   else
  18.     Result:=0;

Spot the Loony ;)

Feel free to use it for your bug report...

« Last Edit: January 17, 2017, 02:03:10 pm by Thaddy »
Read the manuals and if you are a professional get a proper education in computer science. Makes the forum a lot cleaner.

rick2691

  • Sr. Member
  • ****
  • Posts: 374
Re: Entering large Unicode numbers
« Reply #18 on: January 17, 2017, 02:01:53 pm »
Good work. I also found this in RTF manual...

Code: Pascal  [Select]
  1. \uN
  2.  
  3. This keyword represents a single Unicode character that has no equivalent ANSI representation
  4. based on the current ANSI code page. N represents the Unicode character value expressed as a
  5. decimal number.
  6. This keyword is followed immediately by equivalent character(s) in ANSI representation. In this
  7. way, old readers will ignore the \uN keyword and pick up the ANSI representation properly.
  8. When this keyword is encountered, the reader should ignore the next N characters, where N
  9. corresponds to the last \ucN value encountered.
  10. As with all RTF keywords, a keyword-terminating space may be present (before the ANSI
  11. characters) that is not counted in the characters to skip. While this is not likely to occur (or
  12. recommended), a \bin keyword, its argument, and the binary data that follows are considered
  13. one character for skipping purposes. If an RTF scope delimiter character (that is, an opening or
  14. closing brace) is encountered while scanning skippable data, the skippable data is considered to
  15. be ended before the delimiter. This makes it possible for a reader to perform some rudimentary
  16. error recovery. To include an RTF delimiter in skippable data, it must be represented using the
  17. appropriate control symbol (that is, escaped with a backslash,) as in plain text. Any RTF control
  18. word or symbol is considered a single character for the purposes of counting skippable
  19. characters.
  20. An RTF writer, when it encounters a Unicode character with no corresponding ANSI character,
  21. should output \uN followed by the best ANSI representation it can manage. Also, if the Unicode
  22. character translates into an ANSI character stream with a count of bytes differing from the
  23. current Unicode Character Byte Count, it should emit the \ucN keyword prior to the \uN
  24. keyword to notify the reader of the change.
  25. RTF control words generally accept signed 16-bit numbers as arguments. For this reason,
  26. Unicode values greater than 32767 must be expressed as negative numbers.
  27.  
  28. \ucN
  29.  
  30. This keyword represents the number of bytes corresponding to a given \uN Unicode character.
  31. This keyword may be used at any time, and values are scoped like character properties. That is,
  32. a \ucN keyword applies only to text following the keyword, and within the same (or deeper)
  33. nested braces. On exiting the group, the previous \uc value is restored. The reader must keep a
  34. stack of counts seen and use the most recent one to skip the appropriate number of characters
  35. when it encounters a \uN keyword. When leaving an RTF group that specified a \uc value, the
  36. reader must revert to the previous value. A default of 1 should be assumed if no \uc keyword
  37. has been seen in the current or outer scopes.
  38. A common practice is to emit no ANSI representation for Unicode characters within a Unicode
  39. destination context (that is, inside a \ud destination). Typically, the destination will contain a
  40. \uc0 control sequence. There is no need to reset the count on leaving the \ud destination,
  41. because the scoping rules will ensure the previous value is restored.
  42.  

The statement "RTF control words generally accept signed 16-bit numbers as arguments. For this reason, Unicode values greater than 32767 must be expressed as negative numbers" explains the "\u-" codes, so the double set of \u-10238? and another \u-xxxxx code is a pairing, and each is a subtraction from 32767.

Rick
Windows 10, LAZ 1.6.4, FPC 3.0.2, SVN 54278, i386-win32-win32/win64, forms use windows unit

rick2691

  • Sr. Member
  • ****
  • Posts: 374
Re: Entering large Unicode numbers
« Reply #19 on: January 17, 2017, 02:53:31 pm »
Bug reported.

Can't we just edit the inc file ourselves? How will I know when they have addressed the issue?

Rick
Windows 10, LAZ 1.6.4, FPC 3.0.2, SVN 54278, i386-win32-win32/win64, forms use windows unit

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2187
    • havefunsoft.com
Re: Entering large Unicode numbers
« Reply #20 on: January 17, 2017, 03:15:14 pm »
You can actually edit the file yourself and make sure that fix would work.

Once you've it working you could generated a patch file based of your changes and attach it to the bug report. It usually helps to speed up the issue resolution.
Patron Cocoa Widgetset development https://www.patreon.com/skalogryz

rick2691

  • Sr. Member
  • ****
  • Posts: 374
Re: Entering large Unicode numbers
« Reply #21 on: January 17, 2017, 03:18:11 pm »
That much may already been done. I included Thaddy's observations and his correction to the inc code.

Rick
Windows 10, LAZ 1.6.4, FPC 3.0.2, SVN 54278, i386-win32-win32/win64, forms use windows unit

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2187
    • havefunsoft.com
Re: Entering large Unicode numbers
« Reply #22 on: January 17, 2017, 03:39:07 pm »
That much may already been done. I included Thaddy's observations and his correction to the inc code.
Not that easy :)
A certain bureaucracy must be full-filled.
Such as!
You want to apply the code correction to your version of code.
Then you'd need to create a patch file that needs to be attached to the bug reporte.

...or you could simply add a link to this thread to the bug report. But in this case in it might not speed up the process.
Patron Cocoa Widgetset development https://www.patreon.com/skalogryz

Thaddy

  • Hero Member
  • *****
  • Posts: 8197
Re: Entering large Unicode numbers
« Reply #23 on: January 17, 2017, 03:47:11 pm »
There are also multiple ways to patch this, e.g.:
- 1 adapt the code body with similar code to ustrings
- 2 defer LazUTF8.UnicodeToUTF8Inline  to one of the Ustrings.UnicodeToUTF8 ones.
*2 will for now be shot down but probably done in the future anyway.
*1 That's doable, but make sure you pass the tests or add tests for it. 
At the minimum, provide your own example with your patch and show that the new code works ;)

There's a subject on the wiki on how to properly create a patch, but with svn it is rather simple
« Last Edit: January 17, 2017, 03:57:18 pm by Thaddy »
Read the manuals and if you are a professional get a proper education in computer science. Makes the forum a lot cleaner.

Thaddy

  • Hero Member
  • *****
  • Posts: 8197
Re: Entering large Unicode numbers
« Reply #24 on: January 17, 2017, 04:01:34 pm »
@rick2691 IMPORTANT!

You filed the bug report, but you did it the other way around...
The GOOD code is the ustrings code... from FPC itself
The BAD code is the LazUTF8 code... from Lazarus.

So this is not an fpc issue, but a Lazarus issue; LazUTF8.UnicodeToUTF8 can not handle 4 byte extended codepoints as is obvious from the sources. (It can not go higher than three)

So the issue is NOT an FPC issue and should be reported as a Lazarus issue.

I have placed a remark on the bug tracker with a request to move it to Lazarus.
[edit]
It seems that the LazUTF8 one is unicode32. This causes problems because Unicode in FPC is unicode16 with mode delphiunicode or modeswitch unicode.
And there is to string type that can default string to string32 ;) Unicode16 can still handle 4 byte codepoints by the way, so you still can not assume 2 bytes per char.
« Last Edit: January 17, 2017, 04:30:20 pm by Thaddy »
Read the manuals and if you are a professional get a proper education in computer science. Makes the forum a lot cleaner.

rick2691

  • Sr. Member
  • ****
  • Posts: 374
Re: Entering large Unicode numbers
« Reply #25 on: January 17, 2017, 04:30:46 pm »
So you are saying that the ustrings method should be patched into the LazUTF8 unit?

Rick
Windows 10, LAZ 1.6.4, FPC 3.0.2, SVN 54278, i386-win32-win32/win64, forms use windows unit

rick2691

  • Sr. Member
  • ****
  • Posts: 374
Re: Entering large Unicode numbers
« Reply #26 on: January 17, 2017, 04:41:31 pm »
I suppose you have seen the post...


Mattias Gaertner   (manager)
2017-01-17 16:32

LazUTF8.UnicodeToUTF8Inline works here for 0..$fffff. Our test suite runs as well.

Did you only test in RichMemo or did you test the function directly?
Maybe the problem is in RichMemo?
Windows 10, LAZ 1.6.4, FPC 3.0.2, SVN 54278, i386-win32-win32/win64, forms use windows unit

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2187
    • havefunsoft.com
Re: Entering large Unicode numbers
« Reply #27 on: January 17, 2017, 05:22:15 pm »
hmm...
I can see that RichMemo (i'm not positive with RichEdit itself does it) recognize a surrogate pair character as two characters (see the screenshot).

However, I cannot reproduce the "jumping" direction issue. Whenever I insert a character it always advances as RTL character. (Windows 10)
« Last Edit: January 17, 2017, 05:37:49 pm by skalogryz »
Patron Cocoa Widgetset development https://www.patreon.com/skalogryz

Thaddy

  • Hero Member
  • *****
  • Posts: 8197
Re: Entering large Unicode numbers
« Reply #28 on: January 17, 2017, 05:25:41 pm »
So you are saying that the ustrings method should be patched into the LazUTF8 unit?

Rick

From the response on the bugtracker it will eventually be the case that the ustrings version will be used..
Read the manuals and if you are a professional get a proper education in computer science. Makes the forum a lot cleaner.

rick2691

  • Sr. Member
  • ****
  • Posts: 374
Re: Entering large Unicode numbers
« Reply #29 on: January 17, 2017, 05:41:42 pm »
Quote
However, I cannot reproduce the "jumping" direction issue. Whenever I insert a character it always advances as RTL character.

skalogryz,

By the image I am assuming that you are in Win10. Is your application also 64bit?

It may be that it is Win32 widget problem.

Rick
Windows 10, LAZ 1.6.4, FPC 3.0.2, SVN 54278, i386-win32-win32/win64, forms use windows unit