Recent

Author Topic: Entering large Unicode numbers  (Read 15970 times)

rick2691

  • Sr. Member
  • ****
  • Posts: 374
Entering large Unicode numbers
« on: January 16, 2017, 06:12:04 pm »
I have found a Unicode index that appears to be larger than a Win32 cardinal number...

Case UTF8Key of
'b' : UTF8Key:= UnicodeToUTF8(67841);
end;

Is there another way to handle the code?

Rick



Windows 10, LAZ 1.6.4, FPC 3.0.2, SVN 54278, i386-win32-win32/win64, forms use windows unit

wp

  • Hero Member
  • *****
  • Posts: 5831
Re: Entering large Unicode numbers
« Reply #1 on: January 16, 2017, 06:20:13 pm »
Max cardinal is 4,294,967,295. You seem to assume that a unicode character (UTF-16) consists of 2 bytes only, but it can contain 4 bytes as well (http://wiki.lazarus.freepascal.org/LCL_Unicode_Support#Unicode_essentials).
Lazarus trunk / fpc 3.0.4 / all 32-bit on Win-10

Thaddy

  • Hero Member
  • *****
  • Posts: 8174
Re: Entering large Unicode numbers
« Reply #2 on: January 16, 2017, 06:36:40 pm »
Yes, 4294967295, you were quicker.
Here we have a 4 byte UTF8 codepoint.
Read the manuals and if you are a professional get a proper education in computer science. Makes the forum a lot cleaner.

rick2691

  • Sr. Member
  • ****
  • Posts: 374
Re: Entering large Unicode numbers
« Reply #3 on: January 16, 2017, 07:13:36 pm »
Thaddy & wp

I don't follow your point.

67841 is Decimal Unicode...

Decimal    Hex        Name
67841     10901     PHOENICIAN LETTER BET

I assume there is a problem because decimal unicode typically has only 4 digits.

Rick
Windows 10, LAZ 1.6.4, FPC 3.0.2, SVN 54278, i386-win32-win32/win64, forms use windows unit

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2180
    • havefunsoft.com
Re: Entering large Unicode numbers
« Reply #4 on: January 16, 2017, 08:35:32 pm »
Is there another way to handle the code?
is it not working for you?
try to suppress the Key by setting it to #0 and just add the converted text.
The same way you do for RTF specific stuff.

Note. I think you're saying that the code is bigger than WideChar (2 bytes) rather than Cardinal (4 bytes)
The surrogate symbol needs a more than a single WideChar to be handled within WinAPI.
And that's what's causing  the issue.
Patron Cocoa Widgetset development https://www.patreon.com/skalogryz

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2180
    • havefunsoft.com
Re: Entering large Unicode numbers
« Reply #5 on: January 16, 2017, 08:41:05 pm »
Actually it might be a bug in (Win32)LCL.

According to WinAPI, WM_UNICHAR is using UTF-32. Thus assigning a unicode surrogate pair is possible.

I'd recommend to create a bug report so the issue is not forgotten.
« Last Edit: January 16, 2017, 08:49:01 pm by skalogryz »
Patron Cocoa Widgetset development https://www.patreon.com/skalogryz

rick2691

  • Sr. Member
  • ****
  • Posts: 374
Re: Entering large Unicode numbers
« Reply #6 on: January 16, 2017, 08:52:50 pm »
skalogryz,

the unit has... function UnicodeToUTF8(CodePoint: cardinal): string; 
...so it is expecting a cardinal.

Per your suggestion I tried...

       #39 : begin
                UTF8Key:= #0;
                UTF8Key:= UnicodeToUTF8(67840);  // alf  (apostrophe key)
                end;
          'b' : begin
                UTF8Key:= #0;
                UTF8Key:= UnicodeToUTF8(67841);  // bta
                end;
          'g' : begin
                UTF8Key:= #0;
                UTF8Key:= UnicodeToUTF8(67842);  // gma
                end;

It did not set the UTF8Key at all.

I also tried...

        #39 : PutRTFstr('\u67840?');   // alf  (apostrophe key)
          'b' : PutRTFstr('\u67841?');   // bta
          'g' : PutRTFstr('\u67842?');   // gma
          'd' : PutRTFstr('\u67843?');   // dlt
          'h' : PutRTFstr('\u67844?');   // heh
          'w' : PutRTFstr('\u67845?');   // wau

This types the correct character, but it does not advance right-to-left.
I suspect that the font does not have an RTL algorithm.
It is an archaic language. They may be treating it as a symbol font.

I also put it in an RTL paragraph. It did not change anything.

Rick
Windows 10, LAZ 1.6.4, FPC 3.0.2, SVN 54278, i386-win32-win32/win64, forms use windows unit

rick2691

  • Sr. Member
  • ****
  • Posts: 374
Re: Entering large Unicode numbers
« Reply #7 on: January 16, 2017, 08:54:48 pm »
I have never filed a bug report. How do you do that?

Rick
Windows 10, LAZ 1.6.4, FPC 3.0.2, SVN 54278, i386-win32-win32/win64, forms use windows unit

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2180
    • havefunsoft.com
Re: Entering large Unicode numbers
« Reply #8 on: January 16, 2017, 09:18:51 pm »
I have never filed a bug report. How do you do that?
instructions
Patron Cocoa Widgetset development https://www.patreon.com/skalogryz

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2180
    • havefunsoft.com
Re: Entering large Unicode numbers
« Reply #9 on: January 16, 2017, 09:25:16 pm »
the unit has... function UnicodeToUTF8(CodePoint: cardinal): string; 
...so it is expecting a cardinal.

Per your suggestion I tried...
try this:
Code: Pascal  [Select]
  1. procedure TForm1.RichMemo1UTF8KeyPress(Sender: TObject; var UTF8Key: TUTF8Char);
  2. begin
  3.   if UTF8Key = 'b' then begin
  4.     UTF8Key := #0;
  5.     RichMemo1.SelText:=UnicodeToUTF8($10901);
  6.     RichMemo1.SelLength:=0;
  7.    //todo: adjust SelStart as you find fit for the character
  8.   end;
  9. end;    
  10.  
« Last Edit: January 16, 2017, 09:26:51 pm by skalogryz »
Patron Cocoa Widgetset development https://www.patreon.com/skalogryz

rick2691

  • Sr. Member
  • ****
  • Posts: 374
Re: Entering large Unicode numbers
« Reply #10 on: January 16, 2017, 10:17:55 pm »
I tried the above (still omitting the selstart change). I typed the characters, and they tried to advance by RTL, but they would only advance with every other character. It did the same whether it was in an RTL paragraph or not. A very odd behavior.

Code: Pascal  [Select]
  1. Case UTF8Key of           //PutRTFstr('\u-10238?\u-'+intToStr(67840-58882)+'?');
  2.  
  3.           #39 : begin
  4.                 UTF8Key:= #0;
  5.                 PageMemo.SelText:=UnicodeToUTF8($10900);
  6.                 PageMemo.SelLength:=0;
  7.                 end;  //UTF8Key:= UnicodeToUTF8(67840);  // alf  (apostrophe key)
  8.           'b' : begin
  9.                 UTF8Key:= #0;
  10.                 PageMemo.SelText:=UnicodeToUTF8($10901);
  11.                 PageMemo.SelLength:=0;
  12.                 end;  //UTF8Key:= UnicodeToUTF8(67841);  // bta
  13.           'g' : begin
  14.                 UTF8Key:= #0;
  15.                 PageMemo.SelText:=UnicodeToUTF8($10902);
  16.                 PageMemo.SelLength:=0;
  17.                 end;   //UTF8Key:= UnicodeToUTF8(67842);  // gma
  18.           'd' : begin
  19.                 UTF8Key:= #0;
  20.                 PageMemo.SelText:=UnicodeToUTF8($10903);
  21.                 PageMemo.SelLength:=0;
  22.                 end; //UTF8Key:= UnicodeToUTF8(67843);  // dlt
  23.           'h' : begin
  24.                 UTF8Key:= #0;
  25.                 PageMemo.SelText:=UnicodeToUTF8($10904);
  26.                 PageMemo.SelLength:=0;
  27.                 end; //UTF8Key:= UnicodeToUTF8(67844);  // heh
  28.           'w' : begin
  29.                 UTF8Key:= #0;
  30.                 PageMemo.SelText:=UnicodeToUTF8($10905);
  31.                 PageMemo.SelLength:=0;
  32.                 end; //UTF8Key:= UnicodeToUTF8(67845);  // wau
  33.           'z' : begin
  34.                 UTF8Key:= #0;
  35.                 PageMemo.SelText:=UnicodeToUTF8($10906);
  36.                 PageMemo.SelLength:=0;
  37.                 end; //UTF8Key:= UnicodeToUTF8(67846);  // zyn
  38.           'x' : begin
  39.                 UTF8Key:= #0;
  40.                 PageMemo.SelText:=UnicodeToUTF8($10907);
  41.                 PageMemo.SelLength:=0;
  42.                 end; //UTF8Key:= UnicodeToUTF8(67847);  // xet
  43.  

Rick
Windows 10, LAZ 1.6.4, FPC 3.0.2, SVN 54278, i386-win32-win32/win64, forms use windows unit

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2180
    • havefunsoft.com
Re: Entering large Unicode numbers
« Reply #11 on: January 16, 2017, 10:43:40 pm »
I'm wondering if there's a trick with Positioning. I.e. if RichMemo sees a single character inserted as two characters. You might see that by outputting SelStart after the entry.
Patron Cocoa Widgetset development https://www.patreon.com/skalogryz

rick2691

  • Sr. Member
  • ****
  • Posts: 374
Re: Entering large Unicode numbers
« Reply #12 on: January 17, 2017, 01:22:00 pm »
I inserted a showmessage before and after each key-hit. It shows the position the same at before and after, and advanced for the next key.

What shows as the cursor position, however, is cursor to the right then cursor to left, again right, then again left, etc. It does not matter what order the keys are hit. So it isn't the key typed, nor is it selstart. What happens by code is not what you see on the screen.

Additionally what is saved the correct typing order, but it is not the natural unicode values for the keys. Each key is preceded with a \u-10238?, and then a key code... for $10900 it is \u-8953?, then for $10901 it is \u-8955? ...they have the "\u-" prefix (which is something that have never seen, and as well, I have never seen a \u-10238? as a leader for each key) and they are jumping by incrementing 2's .

\par
\f0\fs20\u-10238?\u-8953?\u-10238?\u-8955?\u-10238?\u-8957?\u-10238?\u-8959?\u-10238?\u-8960?\u-10238?\u-8958?\u-10238?\u-8956?\u-10238?\u-8954?\par

All of this is with the NotoSansPhoenician-Regular font.

Rick
« Last Edit: January 17, 2017, 01:27:06 pm by rick2691 »
Windows 10, LAZ 1.6.4, FPC 3.0.2, SVN 54278, i386-win32-win32/win64, forms use windows unit

Thaddy

  • Hero Member
  • *****
  • Posts: 8174
Re: Entering large Unicode numbers
« Reply #13 on: January 17, 2017, 01:26:36 pm »
67841 is Decimal Unicode...
The max for a 2 byte codepoint = 65535. The value 67841 can not fit in a two byte codepoint ..... It needs a 4 byte codepoint. It needs a surrogate pair for starters in UTF16 as well.
« Last Edit: January 17, 2017, 01:31:36 pm by Thaddy »
Read the manuals and if you are a professional get a proper education in computer science. Makes the forum a lot cleaner.

rick2691

  • Sr. Member
  • ****
  • Posts: 374
Re: Entering large Unicode numbers
« Reply #14 on: January 17, 2017, 01:28:20 pm »
thaddy,

So what would be the math for that?

Rick
Windows 10, LAZ 1.6.4, FPC 3.0.2, SVN 54278, i386-win32-win32/win64, forms use windows unit