Lazarus

Programming => Packages and Libraries => RichMemo => Topic started by: rick2691 on January 16, 2017, 06:12:04 pm

Title: Entering large Unicode numbers
Post by: rick2691 on January 16, 2017, 06:12:04 pm
I have found a Unicode index that appears to be larger than a Win32 cardinal number...

Case UTF8Key of
'b' : UTF8Key:= UnicodeToUTF8(67841);
end;

Is there another way to handle the code?

Rick



Title: Re: Entering large Unicode numbers
Post by: wp on January 16, 2017, 06:20:13 pm
Max cardinal is 4,294,967,295. You seem to assume that a unicode character (UTF-16) consists of 2 bytes only, but it can contain 4 bytes as well (http://wiki.lazarus.freepascal.org/LCL_Unicode_Support#Unicode_essentials).
Title: Re: Entering large Unicode numbers
Post by: Thaddy on January 16, 2017, 06:36:40 pm
Yes, 4294967295, you were quicker.
Here we have a 4 byte UTF8 codepoint.
Title: Re: Entering large Unicode numbers
Post by: rick2691 on January 16, 2017, 07:13:36 pm
Thaddy & wp

I don't follow your point.

67841 is Decimal Unicode...

Decimal    Hex        Name
67841     10901     PHOENICIAN LETTER BET

I assume there is a problem because decimal unicode typically has only 4 digits.

Rick
Title: Re: Entering large Unicode numbers
Post by: skalogryz on January 16, 2017, 08:35:32 pm
Is there another way to handle the code?
is it not working for you?
try to suppress the Key by setting it to #0 and just add the converted text.
The same way you do for RTF specific stuff.

Note. I think you're saying that the code is bigger than WideChar (2 bytes) rather than Cardinal (4 bytes)
The surrogate symbol needs a more than a single WideChar to be handled within WinAPI.
And that's what's causing  the issue.
Title: Re: Entering large Unicode numbers
Post by: skalogryz on January 16, 2017, 08:41:05 pm
Actually it might be a bug in (Win32)LCL.

According to WinAPI, WM_UNICHAR (https://msdn.microsoft.com/en-us/library/windows/desktop/ms646288(v=vs.85).aspx) is using UTF-32. Thus assigning a unicode surrogate pair is possible.

I'd recommend to create a bug report so the issue is not forgotten.
Title: Re: Entering large Unicode numbers
Post by: rick2691 on January 16, 2017, 08:52:50 pm
skalogryz,

the unit has... function UnicodeToUTF8(CodePoint: cardinal): string; 
...so it is expecting a cardinal.

Per your suggestion I tried...

       #39 : begin
                UTF8Key:= #0;
                UTF8Key:= UnicodeToUTF8(67840);  // alf  (apostrophe key)
                end;
          'b' : begin
                UTF8Key:= #0;
                UTF8Key:= UnicodeToUTF8(67841);  // bta
                end;
          'g' : begin
                UTF8Key:= #0;
                UTF8Key:= UnicodeToUTF8(67842);  // gma
                end;

It did not set the UTF8Key at all.

I also tried...

        #39 : PutRTFstr('\u67840?');   // alf  (apostrophe key)
          'b' : PutRTFstr('\u67841?');   // bta
          'g' : PutRTFstr('\u67842?');   // gma
          'd' : PutRTFstr('\u67843?');   // dlt
          'h' : PutRTFstr('\u67844?');   // heh
          'w' : PutRTFstr('\u67845?');   // wau

This types the correct character, but it does not advance right-to-left.
I suspect that the font does not have an RTL algorithm.
It is an archaic language. They may be treating it as a symbol font.

I also put it in an RTL paragraph. It did not change anything.

Rick
Title: Re: Entering large Unicode numbers
Post by: rick2691 on January 16, 2017, 08:54:48 pm
I have never filed a bug report. How do you do that?

Rick
Title: Re: Entering large Unicode numbers
Post by: skalogryz on January 16, 2017, 09:18:51 pm
I have never filed a bug report. How do you do that?
instructions (http://wiki.freepascal.org/How_do_I_create_a_bug_report)
Title: Re: Entering large Unicode numbers
Post by: skalogryz on January 16, 2017, 09:25:16 pm
the unit has... function UnicodeToUTF8(CodePoint: cardinal): string; 
...so it is expecting a cardinal.

Per your suggestion I tried...
try this:
Code: Pascal  [Select]
  1. procedure TForm1.RichMemo1UTF8KeyPress(Sender: TObject; var UTF8Key: TUTF8Char);
  2. begin
  3.   if UTF8Key = 'b' then begin
  4.     UTF8Key := #0;
  5.     RichMemo1.SelText:=UnicodeToUTF8($10901);
  6.     RichMemo1.SelLength:=0;
  7.    //todo: adjust SelStart as you find fit for the character
  8.   end;
  9. end;    
  10.  
Title: Re: Entering large Unicode numbers
Post by: rick2691 on January 16, 2017, 10:17:55 pm
I tried the above (still omitting the selstart change). I typed the characters, and they tried to advance by RTL, but they would only advance with every other character. It did the same whether it was in an RTL paragraph or not. A very odd behavior.

Code: Pascal  [Select]
  1. Case UTF8Key of           //PutRTFstr('\u-10238?\u-'+intToStr(67840-58882)+'?');
  2.  
  3.           #39 : begin
  4.                 UTF8Key:= #0;
  5.                 PageMemo.SelText:=UnicodeToUTF8($10900);
  6.                 PageMemo.SelLength:=0;
  7.                 end;  //UTF8Key:= UnicodeToUTF8(67840);  // alf  (apostrophe key)
  8.           'b' : begin
  9.                 UTF8Key:= #0;
  10.                 PageMemo.SelText:=UnicodeToUTF8($10901);
  11.                 PageMemo.SelLength:=0;
  12.                 end;  //UTF8Key:= UnicodeToUTF8(67841);  // bta
  13.           'g' : begin
  14.                 UTF8Key:= #0;
  15.                 PageMemo.SelText:=UnicodeToUTF8($10902);
  16.                 PageMemo.SelLength:=0;
  17.                 end;   //UTF8Key:= UnicodeToUTF8(67842);  // gma
  18.           'd' : begin
  19.                 UTF8Key:= #0;
  20.                 PageMemo.SelText:=UnicodeToUTF8($10903);
  21.                 PageMemo.SelLength:=0;
  22.                 end; //UTF8Key:= UnicodeToUTF8(67843);  // dlt
  23.           'h' : begin
  24.                 UTF8Key:= #0;
  25.                 PageMemo.SelText:=UnicodeToUTF8($10904);
  26.                 PageMemo.SelLength:=0;
  27.                 end; //UTF8Key:= UnicodeToUTF8(67844);  // heh
  28.           'w' : begin
  29.                 UTF8Key:= #0;
  30.                 PageMemo.SelText:=UnicodeToUTF8($10905);
  31.                 PageMemo.SelLength:=0;
  32.                 end; //UTF8Key:= UnicodeToUTF8(67845);  // wau
  33.           'z' : begin
  34.                 UTF8Key:= #0;
  35.                 PageMemo.SelText:=UnicodeToUTF8($10906);
  36.                 PageMemo.SelLength:=0;
  37.                 end; //UTF8Key:= UnicodeToUTF8(67846);  // zyn
  38.           'x' : begin
  39.                 UTF8Key:= #0;
  40.                 PageMemo.SelText:=UnicodeToUTF8($10907);
  41.                 PageMemo.SelLength:=0;
  42.                 end; //UTF8Key:= UnicodeToUTF8(67847);  // xet
  43.  

Rick
Title: Re: Entering large Unicode numbers
Post by: skalogryz on January 16, 2017, 10:43:40 pm
I'm wondering if there's a trick with Positioning. I.e. if RichMemo sees a single character inserted as two characters. You might see that by outputting SelStart after the entry.
Title: Re: Entering large Unicode numbers
Post by: rick2691 on January 17, 2017, 01:22:00 pm
I inserted a showmessage before and after each key-hit. It shows the position the same at before and after, and advanced for the next key.

What shows as the cursor position, however, is cursor to the right then cursor to left, again right, then again left, etc. It does not matter what order the keys are hit. So it isn't the key typed, nor is it selstart. What happens by code is not what you see on the screen.

Additionally what is saved the correct typing order, but it is not the natural unicode values for the keys. Each key is preceded with a \u-10238?, and then a key code... for $10900 it is \u-8953?, then for $10901 it is \u-8955? ...they have the "\u-" prefix (which is something that have never seen, and as well, I have never seen a \u-10238? as a leader for each key) and they are jumping by incrementing 2's .

\par
\f0\fs20\u-10238?\u-8953?\u-10238?\u-8955?\u-10238?\u-8957?\u-10238?\u-8959?\u-10238?\u-8960?\u-10238?\u-8958?\u-10238?\u-8956?\u-10238?\u-8954?\par

All of this is with the NotoSansPhoenician-Regular font.

Rick
Title: Re: Entering large Unicode numbers
Post by: Thaddy on January 17, 2017, 01:26:36 pm
67841 is Decimal Unicode...
The max for a 2 byte codepoint = 65535. The value 67841 can not fit in a two byte codepoint ..... It needs a 4 byte codepoint. It needs a surrogate pair for starters in UTF16 as well.
Title: Re: Entering large Unicode numbers
Post by: rick2691 on January 17, 2017, 01:28:20 pm
thaddy,

So what would be the math for that?

Rick
Title: Re: Entering large Unicode numbers
Post by: Thaddy on January 17, 2017, 01:35:44 pm
thaddy,

So what would be the math for that?

Rick

The math is easy. UTF8Key should be capable to hold 4 bytes. Not two. UnicodetoUtf8 itself handles the surrogate pair correctly.
Title: Re: Entering large Unicode numbers
Post by: rick2691 on January 17, 2017, 01:44:11 pm
I don't think that I would call it "correctly".

Additionally, I did another series and got different increments. This one decrements by 2's for a while, then decrements by 1, the it start incrementing by 2's. When reloaded it still displays the characters in the same improper order as they were before being saved.

\u-10238?\u-8960?\u-10238?\u-8958?\u-10238?\u-8956?\u-10238?\u-8954?\u-10238?\u-8953?\u-10238?\u-8955?\u-10238?\u-8957?\u-10238?\u-8959?

Rick
Title: Re: Entering large Unicode numbers
Post by: Thaddy on January 17, 2017, 01:47:20 pm
Aha. Found it. UnicodeToUTF8Inline in LazUTF8 is buggy and CAN'T handle that code point. UnicodeToUTF8 calls UnicodeToUTF8Inline...
It can't handle high surrogate pairs.
The UnicodeToUTF8 routines from FPC itself, in ustrings, are correct and CAN handle that codepoint.
These can handle high surrogate pairs.

Just examining the code in ustrings.inc and in LazUTF8 immediately makes clear where the bug is, btw:
Right way:
Code: Pascal  [Select]
  1. // ustrings.inc snippet:
  2.              $800..$d7ff,$e000..$ffff:
  3.                 begin
  4.                   if j+2>=MaxDestBytes then
  5.                     break;
  6.                   Dest[j]:=char($e0 or (lw shr 12));
  7.                   Dest[j+1]:=char($80 or ((lw shr 6) and $3f));
  8.                   Dest[j+2]:=char($80 or (lw and $3f));
  9.                   inc(j,3);
  10.                 end;
  11.               $d800..$dbff:
  12.                 {High Surrogates}
  13.                 begin
  14.                   if j+3>=MaxDestBytes then
  15.                     break;
  16.                   if (i+1<sourcechars) and
  17.                      (word(Source[i+1]) >= $dc00) and
  18.                      (word(Source[i+1]) <= $dfff) then
  19.                     begin
  20.                       { $d7c0 is ($d800 - ($10000 shr 10)) }
  21.                       lw:=(longword(lw-$d7c0) shl 10) + (ord(source[i+1]) xor $dc00);
  22.                       Dest[j]:=char($f0 or (lw shr 18));
  23.                       Dest[j+1]:=char($80 or ((lw shr 12) and $3f));
  24.                       Dest[j+2]:=char($80 or ((lw shr 6) and $3f));
  25.                       Dest[j+3]:=char($80 or (lw and $3f));
  26.                       inc(j,4);
  27.                       inc(i);
  28.                     end;
  29.                 end;
  30.               end;
  31.             inc(i);
  32.           end;

VS wrong way:
Code: Pascal  [Select]
  1. //lazutf8 snippet:
  2.    $800..$ffff:
  3.       begin
  4.         Result:=3;
  5.         Buf[0]:=char(byte($e0 or (CodePoint shr 12)));
  6.         Buf[1]:=char(byte((CodePoint shr 6) and $3f) or $80);
  7.         Buf[2]:=char(byte(CodePoint and $3f) or $80);
  8.       end;
  9.     $10000..$10ffff:
  10.       begin
  11.         Result:=4;
  12.         Buf[0]:=char(byte($f0 or (CodePoint shr 18)));
  13.         Buf[1]:=char(byte((CodePoint shr 12) and $3f) or $80);
  14.         Buf[2]:=char(byte((CodePoint shr 6) and $3f) or $80);
  15.         Buf[3]:=char(byte(CodePoint and $3f) or $80);
  16.       end;
  17.   else
  18.     Result:=0;

Spot the Loony ;)

Feel free to use it for your bug report...

Title: Re: Entering large Unicode numbers
Post by: rick2691 on January 17, 2017, 02:01:53 pm
Good work. I also found this in RTF manual...

Code: Pascal  [Select]
  1. \uN
  2.  
  3. This keyword represents a single Unicode character that has no equivalent ANSI representation
  4. based on the current ANSI code page. N represents the Unicode character value expressed as a
  5. decimal number.
  6. This keyword is followed immediately by equivalent character(s) in ANSI representation. In this
  7. way, old readers will ignore the \uN keyword and pick up the ANSI representation properly.
  8. When this keyword is encountered, the reader should ignore the next N characters, where N
  9. corresponds to the last \ucN value encountered.
  10. As with all RTF keywords, a keyword-terminating space may be present (before the ANSI
  11. characters) that is not counted in the characters to skip. While this is not likely to occur (or
  12. recommended), a \bin keyword, its argument, and the binary data that follows are considered
  13. one character for skipping purposes. If an RTF scope delimiter character (that is, an opening or
  14. closing brace) is encountered while scanning skippable data, the skippable data is considered to
  15. be ended before the delimiter. This makes it possible for a reader to perform some rudimentary
  16. error recovery. To include an RTF delimiter in skippable data, it must be represented using the
  17. appropriate control symbol (that is, escaped with a backslash,) as in plain text. Any RTF control
  18. word or symbol is considered a single character for the purposes of counting skippable
  19. characters.
  20. An RTF writer, when it encounters a Unicode character with no corresponding ANSI character,
  21. should output \uN followed by the best ANSI representation it can manage. Also, if the Unicode
  22. character translates into an ANSI character stream with a count of bytes differing from the
  23. current Unicode Character Byte Count, it should emit the \ucN keyword prior to the \uN
  24. keyword to notify the reader of the change.
  25. RTF control words generally accept signed 16-bit numbers as arguments. For this reason,
  26. Unicode values greater than 32767 must be expressed as negative numbers.
  27.  
  28. \ucN
  29.  
  30. This keyword represents the number of bytes corresponding to a given \uN Unicode character.
  31. This keyword may be used at any time, and values are scoped like character properties. That is,
  32. a \ucN keyword applies only to text following the keyword, and within the same (or deeper)
  33. nested braces. On exiting the group, the previous \uc value is restored. The reader must keep a
  34. stack of counts seen and use the most recent one to skip the appropriate number of characters
  35. when it encounters a \uN keyword. When leaving an RTF group that specified a \uc value, the
  36. reader must revert to the previous value. A default of 1 should be assumed if no \uc keyword
  37. has been seen in the current or outer scopes.
  38. A common practice is to emit no ANSI representation for Unicode characters within a Unicode
  39. destination context (that is, inside a \ud destination). Typically, the destination will contain a
  40. \uc0 control sequence. There is no need to reset the count on leaving the \ud destination,
  41. because the scoping rules will ensure the previous value is restored.
  42.  

The statement "RTF control words generally accept signed 16-bit numbers as arguments. For this reason, Unicode values greater than 32767 must be expressed as negative numbers" explains the "\u-" codes, so the double set of \u-10238? and another \u-xxxxx code is a pairing, and each is a subtraction from 32767.

Rick
Title: Re: Entering large Unicode numbers
Post by: rick2691 on January 17, 2017, 02:53:31 pm
Bug reported.

Can't we just edit the inc file ourselves? How will I know when they have addressed the issue?

Rick
Title: Re: Entering large Unicode numbers
Post by: skalogryz on January 17, 2017, 03:15:14 pm
You can actually edit the file yourself and make sure that fix would work.

Once you've it working you could generated a patch file based of your changes and attach it to the bug report. It usually helps to speed up the issue resolution.
Title: Re: Entering large Unicode numbers
Post by: rick2691 on January 17, 2017, 03:18:11 pm
That much may already been done. I included Thaddy's observations and his correction to the inc code.

Rick
Title: Re: Entering large Unicode numbers
Post by: skalogryz on January 17, 2017, 03:39:07 pm
That much may already been done. I included Thaddy's observations and his correction to the inc code.
Not that easy :)
A certain bureaucracy must be full-filled.
Such as!
You want to apply the code correction to your version of code.
Then you'd need to create a patch (http://wiki.freepascal.org/Creating_A_Patch) file that needs to be attached to the bug reporte.

...or you could simply add a link to this thread to the bug report. But in this case in it might not speed up the process.
Title: Re: Entering large Unicode numbers
Post by: Thaddy on January 17, 2017, 03:47:11 pm
There are also multiple ways to patch this, e.g.:
- 1 adapt the code body with similar code to ustrings
- 2 defer LazUTF8.UnicodeToUTF8Inline  to one of the Ustrings.UnicodeToUTF8 ones.
*2 will for now be shot down but probably done in the future anyway.
*1 That's doable, but make sure you pass the tests or add tests for it. 
At the minimum, provide your own example with your patch and show that the new code works ;)

There's a subject on the wiki on how to properly create a patch, but with svn it is rather simple
Title: Re: Entering large Unicode numbers
Post by: Thaddy on January 17, 2017, 04:01:34 pm
@rick2691 IMPORTANT!

You filed the bug report, but you did it the other way around...
The GOOD code is the ustrings code... from FPC itself
The BAD code is the LazUTF8 code... from Lazarus.

So this is not an fpc issue, but a Lazarus issue; LazUTF8.UnicodeToUTF8 can not handle 4 byte extended codepoints as is obvious from the sources. (It can not go higher than three)

So the issue is NOT an FPC issue and should be reported as a Lazarus issue.

I have placed a remark on the bug tracker with a request to move it to Lazarus.
[edit]
It seems that the LazUTF8 one is unicode32. This causes problems because Unicode in FPC is unicode16 with mode delphiunicode or modeswitch unicode.
And there is to string type that can default string to string32 ;) Unicode16 can still handle 4 byte codepoints by the way, so you still can not assume 2 bytes per char.
Title: Re: Entering large Unicode numbers
Post by: rick2691 on January 17, 2017, 04:30:46 pm
So you are saying that the ustrings method should be patched into the LazUTF8 unit?

Rick
Title: Re: Entering large Unicode numbers
Post by: rick2691 on January 17, 2017, 04:41:31 pm
I suppose you have seen the post...


Mattias Gaertner   (manager)
2017-01-17 16:32

LazUTF8.UnicodeToUTF8Inline works here for 0..$fffff. Our test suite runs as well.

Did you only test in RichMemo or did you test the function directly?
Maybe the problem is in RichMemo?
Title: Re: Entering large Unicode numbers
Post by: skalogryz on January 17, 2017, 05:22:15 pm
hmm...
I can see that RichMemo (i'm not positive with RichEdit itself does it) recognize a surrogate pair character as two characters (see the screenshot).

However, I cannot reproduce the "jumping" direction issue. Whenever I insert a character it always advances as RTL character. (Windows 10)
Title: Re: Entering large Unicode numbers
Post by: Thaddy on January 17, 2017, 05:25:41 pm
So you are saying that the ustrings method should be patched into the LazUTF8 unit?

Rick

From the response on the bugtracker it will eventually be the case that the ustrings version will be used..
Title: Re: Entering large Unicode numbers
Post by: rick2691 on January 17, 2017, 05:41:42 pm
Quote
However, I cannot reproduce the "jumping" direction issue. Whenever I insert a character it always advances as RTL character.

skalogryz,

By the image I am assuming that you are in Win10. Is your application also 64bit?

It may be that it is Win32 widget problem.

Rick
Title: Re: Entering large Unicode numbers
Post by: skalogryz on January 17, 2017, 05:46:21 pm
By the image I am assuming that you are in Win10. Is your application also 64bit?

It may be that it is Win32 widget problem.
Yes, it's Win10 but it's 32-bit application.
Title: Re: Entering large Unicode numbers
Post by: rick2691 on January 17, 2017, 06:14:15 pm
I have attached a series of 6 screen shots. It is in an RTL paragraph. I am hitting the same key each time. It is difficult to see the motion with all of them being the same character, but with all 6 you might see it. The caret shifts around, going right and left, and the actual key is appearing beside the caret (not at the far left end of the series).

Rick

Title: Re: Entering large Unicode numbers
Post by: rick2691 on January 17, 2017, 06:36:19 pm
This is the same demonstration, but it is with a LTR paragraph, and I am using new characters. Notice where the caret is (as to right or left) to the new character.
Title: Re: Entering large Unicode numbers
Post by: rick2691 on January 17, 2017, 07:13:13 pm
I installed RichMemo in Win10 and it did similar to yours. The image shows the top line as an LTR paragraph, and the lower line is an RTL paragraph. Both the characters and caret moved in an LTR method within both paragraphs. So there is no RTL function in Win10.

Given that there is erratic behavior in WinXP, I don't know if it has RTL function either.

Rick
Title: Re: Entering large Unicode numbers
Post by: skalogryz on January 17, 2017, 07:19:25 pm
I'd think if you try to adjust SelStart manually, you would be able to achieve the correct result after the insertion.

I'd think the more important is to verify that the navigation through entered characters are working as expected.
Title: Re: Entering large Unicode numbers
Post by: rick2691 on January 17, 2017, 08:02:16 pm
I don't expect that I can do that with WinXP. It is not indexing the caret by a mathematical method. I can try, but I haven't because (even if I succeed) I also expect that it will not know how to wrap the text. It would probably treat it as LTR... but I can see if that is true.

Rick

Title: Re: Entering large Unicode numbers
Post by: rick2691 on January 17, 2017, 08:30:19 pm
OK, I did that. In doing so I also found that I had some code that was already trying to do that. I had forgotten. It is what was causing the erratic behavior. I apologize. I have been the cause of a lot of trouble.

But I was right about wrapping. It wraps as an LTR paragraph. I don't think that this font has an RTL attribute embedded within it. It is a symbol font, as I once had thought. You can't tell with unicode.

Again, I am embarrassed. Please accept my apology.

Rick
Title: Re: Entering large Unicode numbers
Post by: skalogryz on January 17, 2017, 09:21:15 pm
That's fine.

I think you still found an issue that overwriting UTF8Key value with a surrogate pair option doesn't work at all.
I tried to fix that on my end, but I found that even sending a UTF32 via WM_UNICHAR doesn't work (an inserted character is a "tofo" character). However setting the same character over SelText works just fine. 
Title: Re: Entering large Unicode numbers
Post by: rick2691 on January 18, 2017, 06:15:12 pm
I have Phoenician working by using the hexadecimal codes. The cursor point advances with the characters going right to left (no tofu), and it operates that way with both LTR and RTL paragraphs.

The only thing that it does not do is to wrap properly (with both LTR or RTL). It always wraps as if it is English.

I don't know if it is a font problem, or another issue with the unicode index exceeding the range of our compiler. It is probably the latter.

Rick
Title: Re: Entering large Unicode numbers
Post by: rick2691 on January 19, 2017, 01:50:26 pm
I have hit the last straw with the Phoenician language. After editing my User Manual for including the Phoenician font, subsequent saves and reloads have created erratic tofu (sometimes keeping the character, and other times rejecting it). Consequently, I am suspending the operations for Phoenician.

Although we don't have any extant Phoenician documents, it is because they wrote on papyrus and skins. They lived on the coast lands of the Mediterranean Sea, and the humidity had caused all of their lifestyle records to decay. All that remains are short phrases, an array of letters, or single words that were engraved on metal or a stone.

Nevertheless, they were the first people to devise a method for phonetic writing. Hebrew, Syriac, Samaritan, Persian, Canaani, Coptic, Greek, and subsequently even English and Russian were all contrived by adopting the Phoenician method for writing. Moreover, the Hebrew Scriptures were all written and maintained by using the Phoenician script from the time of Moses, and up to the Hebrew exile into Babylon. So I wanted to include it in my application, and also because there are extant Dead Sea Scrolls that were written with it.

I regret that I cannot include it in my application at this time, but I am retaining the code that we have devised (through this Forum exchange), for the chance that it might be reapplied at some point.

It is unfortunate that the Phoenician Unicode has been indexed at values that are higher than the Cardinal Range that we have to work with.

Rick
Title: Re: Entering large Unicode numbers
Post by: skalogryz on January 19, 2017, 02:29:27 pm
I have hit the last straw with the Phoenician language. After editing my User Manual for including the Phoenician font, subsequent saves and reloads have created erratic tofu (sometimes keeping the character, and other times rejecting it). Consequently, I am suspending the operations for Phoenician.
I presume you save/load on XP... there might be a trick that could help.
Did you try to load the saved RTF in some other editor? (to see if tofu is there or not)
Title: Re: Entering large Unicode numbers
Post by: rick2691 on January 19, 2017, 03:31:11 pm
I had not, but I just did. PolyEdit had the same problem, and OpenOffice Writer had different but similar problems.

Rick
Title: Re: Entering large Unicode numbers
Post by: skalogryz on January 19, 2017, 03:39:07 pm
could you please provide the following:
* the saved file
* the expected behavior (how the file should look like when it opens back)
Title: Re: Entering large Unicode numbers
Post by: rick2691 on January 19, 2017, 09:22:42 pm
You must go to https://www.google.com/get/noto/ ...then download and install the following font files:

NotoSans-Bold.ttf
NotoSans-BoldItalic.ttf
NotoSans-Italic.ttf
NotoSans-Regular.ttf
NotoSansHebrew-Bold.ttf
NotoSansHebrew-Regular.ttf
NotoSansPhoenician-Regular.ttf
NotoSansSyriacEstrangela-Regular.ttf

They are hard to find because an alphabetic order ignores the dash character.

Attached, herewith, are 2 RTF files... "CmdBlue Key-Map Tofu by Font Binding.rtf" and "CmdBlue Key-Map Tofu Binding Removed.rtf".

"by font binding" is the file with switched fonts.
It is for your comparison with an ASCII editor.

"binding removed" is where I edited the RTF file to reverse the font binding.
It is for your loading with an RTF editor.

At the start of the document is a table of consonants and descriptions.

On the far left column is a "y". Its row, at the 5th and 6th column is the tofu.

Further down, at the far left column, is a "t".

At the 4th column of that row there is tofu before the Phoenician character.

Do not save anything unless you save it with another name.

If you look at the "by font binding" file with an ASCII editor you will find that it has switched the font to \f10, which is the Hebrew font. It should be \f0, which is the English font.

The same, with the "binding removed" file in the RTF editor, if you click the tofu it will say that it is "Noto Sans Hebrew". It should be "Noto Sans".

It is because RichEdit does not know how to process the Phoenician font... and it is very bad at what it tries to do. It is showing as tofu because there are no English characters in the "Noto Sans Hebrew" font, and it ignores that it is designated as "Noto Sans". It thinks it knows better.

Of course, Font Binding had been eliminated by the Not Sans family of fonts. This is only happening because the Unicode index for Phoenician has made it choke.

Rick
Title: Re: Entering large Unicode numbers
Post by: Bart on January 22, 2017, 01:49:38 am
Aha. Found it. UnicodeToUTF8Inline in LazUTF8 is buggy and CAN'T handle that code point. UnicodeToUTF8 calls UnicodeToUTF8Inline...
It can't handle high surrogate pairs.

Thaddy: please see my nores and sample application in Issue #31243 (http://bugs.freepascal.org/view.php?id=31243).
AFAICS they all pass OK.

@rick2691: please respond to the bugtracker.

Bart
Title: Re: Entering large Unicode numbers
Post by: rick2691 on January 24, 2017, 04:32:42 pm
skalogryz,

Bart closed the bug report.

Quote
Anyhow, this has nothing to do with LazUtf8.UnicodeToUtf8() function.
Resolving as "no change required".

I don't think that I will ever go through that process again. Too techy for me.

Rick

Title: Re: Entering large Unicode numbers
Post by: Bart on January 24, 2017, 06:28:59 pm
Bart closed the bug report.

Quote
Anyhow, this has nothing to do with LazUtf8.UnicodeToUtf8() function.
Resolving as "no change required".

I don't think that I will ever go through that process again. Too techy for me.

Please do not get discouraged by this.

It is quite allright to file bugreports.
If you are not really sure wether something is a bug or not, then better ask on forum or mailinglist.

Your bugreport claimed that LazUtf8.UnicodeToUTF8 function was wrong.
This turned out to be not the case.
Therefore I marked the issue as "no change required".
(You, as the original reporter, should close the issue.)

You may very well have an issue w.r.t. RichMemo, but this component is not part of Lazarus distribution AFAIK.
I think it is part of Lazarus-CCR.

If indeed RichMemo does not handle this correctly (ask for confirmation on this from other users), then file a new bugreport in the Lazarus-CCR section of the bugtracker.
You should always attach a sample project (minimal case scenario, sources only, zipped), so that developers can easily reproduce your problem.

Please read http://wiki.lazarus.freepascal.org/Tips_on_writing_bug_reports (http://wiki.lazarus.freepascal.org/Tips_on_writing_bug_reports).

Bart