Entering large Unicode numbers

Thaddy

Hero Member
Posts: 14369
Sensorship about opinions does not belong here.

Re: Entering large Unicode numbers

« Reply #15 on: January 17, 2017, 01:35:44 pm »

Quote from: rick2691 on January 17, 2017, 01:28:20 pm

thaddy,

So what would be the math for that?

Rick

The math is easy. UTF8Key should be capable to hold 4 bytes. Not two. UnicodetoUtf8 itself handles the surrogate pair correctly.

Logged

Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

rick2691

Sr. Member
Posts: 444

Re: Entering large Unicode numbers

« Reply #16 on: January 17, 2017, 01:44:11 pm »

I don't think that I would call it "correctly".

Additionally, I did another series and got different increments. This one decrements by 2's for a while, then decrements by 1, the it start incrementing by 2's. When reloaded it still displays the characters in the same improper order as they were before being saved.

\u-10238?\u-8960?\u-10238?\u-8958?\u-10238?\u-8956?\u-10238?\u-8954?\u-10238?\u-8953?\u-10238?\u-8955?\u-10238?\u-8957?\u-10238?\u-8959?

Rick

Logged

Windows 11, LAZ 2.0.10, FPC 3.2.0, SVN 63526, i386-win32-win32/win64, using windows unit

Thaddy

Hero Member
Posts: 14369
Sensorship about opinions does not belong here.

Re: Entering large Unicode numbers

« Reply #17 on: January 17, 2017, 01:47:20 pm »

Aha. Found it. UnicodeToUTF8Inline in LazUTF8 is buggy and CAN'T handle that code point. UnicodeToUTF8 calls UnicodeToUTF8Inline...
It can't handle high surrogate pairs.
The UnicodeToUTF8 routines from FPC itself, in ustrings, are correct and CAN handle that codepoint.
These can handle high surrogate pairs.

Just examining the code in ustrings.inc and in LazUTF8 immediately makes clear where the bug is, btw:
Right way:

Code: Pascal [Select][+]

// ustrings.inc snippet:
             $800..$d7ff,$e000..$ffff:
                begin
                  if j+2>=MaxDestBytes then
                    break;
                  Dest[j]:=char($e0 or (lw shr 12));
                  Dest[j+1]:=char($80 or ((lw shr 6) and $3f));
                  Dest[j+2]:=char($80 or (lw and $3f));
                  inc(j,3);
                end;
              $d800..$dbff:
                {High Surrogates}
                begin
                  if j+3>=MaxDestBytes then
                    break;
                  if (i+1<sourcechars) and
                     (word(Source[i+1]) >= $dc00) and
                     (word(Source[i+1]) <= $dfff) then
                    begin
                      { $d7c0 is ($d800 - ($10000 shr 10)) }
                      lw:=(longword(lw-$d7c0) shl 10) + (ord(source[i+1]) xor $dc00);
                      Dest[j]:=char($f0 or (lw shr 18));
                      Dest[j+1]:=char($80 or ((lw shr 12) and $3f));
                      Dest[j+2]:=char($80 or ((lw shr 6) and $3f));
                      Dest[j+3]:=char($80 or (lw and $3f));
                      inc(j,4);
                      inc(i);
                    end;
                end;
              end;
            inc(i);
          end;

VS wrong way:

Code: Pascal [Select][+]

//lazutf8 snippet:
   $800..$ffff:
      begin
        Result:=3;
        Buf[0]:=char(byte($e0 or (CodePoint shr 12)));
        Buf[1]:=char(byte((CodePoint shr 6) and $3f) or $80);
        Buf[2]:=char(byte(CodePoint and $3f) or $80);
      end;
    $10000..$10ffff:
      begin
        Result:=4;
        Buf[0]:=char(byte($f0 or (CodePoint shr 18)));
        Buf[1]:=char(byte((CodePoint shr 12) and $3f) or $80);
        Buf[2]:=char(byte((CodePoint shr 6) and $3f) or $80);
        Buf[3]:=char(byte(CodePoint and $3f) or $80);
      end;
  else
    Result:=0;

Spot the Loony

Feel free to use it for your bug report...

« Last Edit: January 17, 2017, 02:03:10 pm by Thaddy »

Logged

Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

rick2691

Sr. Member
Posts: 444

Re: Entering large Unicode numbers

« Reply #18 on: January 17, 2017, 02:01:53 pm »

Good work. I also found this in RTF manual...

Code: Pascal [Select][+]

\uN 
 
This keyword represents a single Unicode character that has no equivalent ANSI representation
based on the current ANSI code page. N represents the Unicode character value expressed as a
decimal number.
This keyword is followed immediately by equivalent character(s) in ANSI representation. In this
way, old readers will ignore the \uN keyword and pick up the ANSI representation properly.
When this keyword is encountered, the reader should ignore the next N characters, where N
corresponds to the last \ucN value encountered.
As with all RTF keywords, a keyword-terminating space may be present (before the ANSI
characters) that is not counted in the characters to skip. While this is not likely to occur (or
recommended), a \bin keyword, its argument, and the binary data that follows are considered
one character for skipping purposes. If an RTF scope delimiter character (that is, an opening or
closing brace) is encountered while scanning skippable data, the skippable data is considered to
be ended before the delimiter. This makes it possible for a reader to perform some rudimentary
error recovery. To include an RTF delimiter in skippable data, it must be represented using the
appropriate control symbol (that is, escaped with a backslash,) as in plain text. Any RTF control
word or symbol is considered a single character for the purposes of counting skippable
characters.
An RTF writer, when it encounters a Unicode character with no corresponding ANSI character,
should output \uN followed by the best ANSI representation it can manage. Also, if the Unicode
character translates into an ANSI character stream with a count of bytes differing from the
current Unicode Character Byte Count, it should emit the \ucN keyword prior to the \uN
keyword to notify the reader of the change.
RTF control words generally accept signed 16-bit numbers as arguments. For this reason,
Unicode values greater than 32767 must be expressed as negative numbers.
 
\ucN 
 
This keyword represents the number of bytes corresponding to a given \uN Unicode character.
This keyword may be used at any time, and values are scoped like character properties. That is,
a \ucN keyword applies only to text following the keyword, and within the same (or deeper)
nested braces. On exiting the group, the previous \uc value is restored. The reader must keep a
stack of counts seen and use the most recent one to skip the appropriate number of characters
when it encounters a \uN keyword. When leaving an RTF group that specified a \uc value, the
reader must revert to the previous value. A default of 1 should be assumed if no \uc keyword
has been seen in the current or outer scopes.
A common practice is to emit no ANSI representation for Unicode characters within a Unicode
destination context (that is, inside a \ud destination). Typically, the destination will contain a
\uc0 control sequence. There is no need to reset the count on leaving the \ud destination,
because the scoping rules will ensure the previous value is restored.
 

The statement "RTF control words generally accept signed 16-bit numbers as arguments. For this reason, Unicode values greater than 32767 must be expressed as negative numbers" explains the "\u-" codes, so the double set of \u-10238? and another \u-xxxxx code is a pairing, and each is a subtraction from 32767.

Rick

Logged

Windows 11, LAZ 2.0.10, FPC 3.2.0, SVN 63526, i386-win32-win32/win64, using windows unit

rick2691

Sr. Member
Posts: 444

Re: Entering large Unicode numbers

« Reply #19 on: January 17, 2017, 02:53:31 pm »

Bug reported.

Can't we just edit the inc file ourselves? How will I know when they have addressed the issue?

Rick

Logged

Windows 11, LAZ 2.0.10, FPC 3.2.0, SVN 63526, i386-win32-win32/win64, using windows unit

skalogryz

Global Moderator
Hero Member
Posts: 2770

Re: Entering large Unicode numbers

« Reply #20 on: January 17, 2017, 03:15:14 pm »

You can actually edit the file yourself and make sure that fix would work.

Once you've it working you could generated a patch file based of your changes and attach it to the bug report. It usually helps to speed up the issue resolution.

Logged

rick2691

Sr. Member
Posts: 444

Re: Entering large Unicode numbers

« Reply #21 on: January 17, 2017, 03:18:11 pm »

That much may already been done. I included Thaddy's observations and his correction to the inc code.

Rick

Logged

Windows 11, LAZ 2.0.10, FPC 3.2.0, SVN 63526, i386-win32-win32/win64, using windows unit

skalogryz

Global Moderator
Hero Member
Posts: 2770

Re: Entering large Unicode numbers

« Reply #22 on: January 17, 2017, 03:39:07 pm »

Quote from: rick2691 on January 17, 2017, 03:18:11 pm

That much may already been done. I included Thaddy's observations and his correction to the inc code.

Not that easy

A certain bureaucracy must be full-filled.
Such as!
You want to apply the code correction to your version of code.
Then you'd need to create a patch file that needs to be attached to the bug reporte.

...or you could simply add a link to this thread to the bug report. But in this case in it might not speed up the process.

Logged

Thaddy

Hero Member
Posts: 14369
Sensorship about opinions does not belong here.

Re: Entering large Unicode numbers

« Reply #23 on: January 17, 2017, 03:47:11 pm »

There are also multiple ways to patch this, e.g.:
- 1 adapt the code body with similar code to ustrings
- 2 defer LazUTF8.UnicodeToUTF8Inline to one of the Ustrings.UnicodeToUTF8 ones.
*2 will for now be shot down but probably done in the future anyway.
*1 That's doable, but make sure you pass the tests or add tests for it.
At the minimum, provide your own example with your patch and show that the new code works

There's a subject on the wiki on how to properly create a patch, but with svn it is rather simple

« Last Edit: January 17, 2017, 03:57:18 pm by Thaddy »

Logged

Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

Thaddy

Hero Member
Posts: 14369
Sensorship about opinions does not belong here.

Re: Entering large Unicode numbers

« Reply #24 on: January 17, 2017, 04:01:34 pm »

@rick2691 IMPORTANT!

You filed the bug report, but you did it the other way around...
The GOOD code is the ustrings code... from FPC itself
The BAD code is the LazUTF8 code... from Lazarus.

So this is not an fpc issue, but a Lazarus issue; LazUTF8.UnicodeToUTF8 can not handle 4 byte extended codepoints as is obvious from the sources. (It can not go higher than three)

So the issue is NOT an FPC issue and should be reported as a Lazarus issue.

I have placed a remark on the bug tracker with a request to move it to Lazarus.
[edit]
It seems that the LazUTF8 one is unicode32. This causes problems because Unicode in FPC is unicode16 with mode delphiunicode or modeswitch unicode.
And there is to string type that can default string to string32

Unicode16 can still handle 4 byte codepoints by the way, so you still can not assume 2 bytes per char.

« Last Edit: January 17, 2017, 04:30:20 pm by Thaddy »

Logged

Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

rick2691

Sr. Member
Posts: 444

Re: Entering large Unicode numbers

« Reply #25 on: January 17, 2017, 04:30:46 pm »

So you are saying that the ustrings method should be patched into the LazUTF8 unit?

Rick

Logged

Windows 11, LAZ 2.0.10, FPC 3.2.0, SVN 63526, i386-win32-win32/win64, using windows unit

rick2691

Sr. Member
Posts: 444

Re: Entering large Unicode numbers

« Reply #26 on: January 17, 2017, 04:41:31 pm »

I suppose you have seen the post...

Mattias Gaertner (manager)
2017-01-17 16:32

LazUTF8.UnicodeToUTF8Inline works here for 0..$fffff. Our test suite runs as well.

Did you only test in RichMemo or did you test the function directly?
Maybe the problem is in RichMemo?

Logged

Windows 11, LAZ 2.0.10, FPC 3.2.0, SVN 63526, i386-win32-win32/win64, using windows unit

skalogryz

Global Moderator
Hero Member
Posts: 2770

Re: Entering large Unicode numbers

« Reply #27 on: January 17, 2017, 05:22:15 pm »

hmm...
I can see that RichMemo (i'm not positive with RichEdit itself does it) recognize a surrogate pair character as two characters (see the screenshot).

However, I cannot reproduce the "jumping" direction issue. Whenever I insert a character it always advances as RTL character. (Windows 10)

cursorpos.png (4.72 kB, 351x368 - viewed 421 times.)

« Last Edit: January 17, 2017, 05:37:49 pm by skalogryz »

Logged

Thaddy

Hero Member
Posts: 14369
Sensorship about opinions does not belong here.

Re: Entering large Unicode numbers

« Reply #28 on: January 17, 2017, 05:25:41 pm »

Quote from: rick2691 on January 17, 2017, 04:30:46 pm

So you are saying that the ustrings method should be patched into the LazUTF8 unit?

Rick

From the response on the bugtracker it will eventually be the case that the ustrings version will be used..

Logged

Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

rick2691

Sr. Member
Posts: 444

Re: Entering large Unicode numbers

« Reply #29 on: January 17, 2017, 05:41:42 pm »

Quote

However, I cannot reproduce the "jumping" direction issue. Whenever I insert a character it always advances as RTL character.

skalogryz,

By the image I am assuming that you are in Win10. Is your application also 64bit?

It may be that it is Win32 widget problem.

Rick

Logged

Windows 11, LAZ 2.0.10, FPC 3.2.0, SVN 63526, i386-win32-win32/win64, using windows unit

Lazarus

Bookstore

Search

Recent

Author Topic: Entering large Unicode numbers (Read 25713 times)

Thaddy

Re: Entering large Unicode numbers

rick2691

Re: Entering large Unicode numbers

Thaddy

Re: Entering large Unicode numbers

rick2691

Re: Entering large Unicode numbers

rick2691

Re: Entering large Unicode numbers

skalogryz

Re: Entering large Unicode numbers

rick2691

Re: Entering large Unicode numbers

skalogryz

Re: Entering large Unicode numbers

Thaddy

Re: Entering large Unicode numbers

Thaddy

Re: Entering large Unicode numbers

rick2691

Re: Entering large Unicode numbers

rick2691

Re: Entering large Unicode numbers

skalogryz

Re: Entering large Unicode numbers

Thaddy

Re: Entering large Unicode numbers

rick2691

Re: Entering large Unicode numbers

	Computer Math and Games in Pascal (preview)
	Lazarus Handbook