Recent

Author Topic: Detecting a Unicode language family  (Read 14446 times)

rick2691

  • Sr. Member
  • ****
  • Posts: 444
Detecting a Unicode language family
« on: January 26, 2017, 04:18:39 pm »
I have been detecting what language is active by whether a default font is activated when you click at a character or use one the positional keys (home, end, arrow etc).

But now I am using the same font for both English and Greek. So I am having to check the character index by the method below...

if (ord(QryChr)=206) or
  (ord(QryChr)=207) or                       // ord(char)is decimal index of character...
  (ord(QryChr)=225)    // accented       // 206, 207 or 225 for Noto Sans font
  then SetGreek                                 // 214 or 215 for Noto Sans Hebrew
  else SetEnglish;                               // 220 or 221 for Noto Sans Syriac 

As per the side notes. Greek reports any of 3 indexes: 206, 207 & 225. Hebrew would do 214 or 215, and Syriac does 220 or 221. I realize that these are partial indexes (only the first elements of pairing elements), but for now it is enough to get the job done because they are different from each other.

Nevertheless, I would like to retrieve the full Unicode identity for the character... ie. its Unicode Index value (such as a decimal 1488 or 1490 index name). If I can get this data I won't be limited to my default fonts, or be subject to quirky values, such as how Greek reports 3 different Ordinate values instead of only 2 (and there might be another that I haven't encountered).

I have tried to find information about this on the Internet, and I have not been able to find anything that is useful. Likewise, I don't know how to get the additional character elements.

Any help will be appreciated.

Rick

Windows 11, LAZ 2.0.10, FPC 3.2.0, SVN 63526, i386-win32-win32/win64, using windows unit

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2770
    • havefunsoft.com
Re: Detecting a Unicode language family
« Reply #1 on: January 26, 2017, 04:22:14 pm »
How do you get QryChr ?

rick2691

  • Sr. Member
  • ****
  • Posts: 444
Re: Detecting a Unicode language family
« Reply #2 on: January 26, 2017, 06:24:57 pm »
Sorry. I didn't think it would matter.

Code: Pascal  [Select][+][-]
  1. procedure TCmdForm.ClickPageMemo(Sender: TObject);
  2. var CharCnt: longint;
  3.     StartPos,StartLength,QryPos: longint;
  4.     QryChr: char;
  5.     QryStr: string;
  6.     ShiftOn: boolean;
  7. begin     (* UnicodeToUtf8 Utf8ToUnicode UTF8Encode UTF8Decode AnsiToUtf8 Utf8ToAnsi *)
  8.   ReportPosition;
  9.   if (CurrentLin=1) and (CurrentPos=1)
  10.      then PageMemo.GetTextAttributes(PageMemo.SelStart, SelFontFormat)   // get in place
  11.      else PageMemo.GetTextAttributes(PageMemo.SelStart-1, SelFontFormat);  // get by prior
  12.   //SHOWMESSAGE('Lin='+inttostr(CurrentLin)+' Pos='+inttostr(CurrentPos));
  13.  
  14.   ShiftOn:= IsShiftKeyPressed;
  15.   if not ShiftOn then
  16.      begin
  17.      //activate keyboard language
  18.      if (SelFontFormat.name=DefEng) and (DefEng=DefGrk) then
  19.          begin
  20.          StartLength:= PageMemo.SelLength;
  21.          StartPos:= PageMemo.SelStart;
  22.          if StartPos=0 then QryPos:= 0
  23.                        else QryPos:= StartPos-1;
  24.          SendMessage(PageMemo.Handle, EM_HIDESELECTION, 1, 0); // hide selection
  25.          PageMemo.SelStart:= QryPos;
  26.          PageMemo.SelLength:= 1;
  27.          QryStr:= PageMemo.SelText;
  28.          QryChr:= QryStr[1];
  29.          while (((QryChr=' ') or (QryChr= #9)) and (QryPos>0)) do
  30.                begin
  31.                QryPos:= QryPos-1;
  32.                PageMemo.SelStart:= QryPos;
  33.                PageMemo.SelLength:= 1;
  34.                QryStr:= PageMemo.SelText;
  35.                QryChr:= QryStr[1];
  36.                end;
  37.          PageMemo.SelStart:= StartPos;
  38.          PageMemo.SelLength:= StartLength;
  39.          SendMessage(PageMemo.Handle, EM_HIDESELECTION, 0, 0); // show selection
  40.          if (ord(QryChr)=206) or
  41.             (ord(QryChr)=207) or                   // ord(char)is decimal index of character...
  42.             (ord(QryChr)=225)    // accented       // 206, 207 or 225 for Noto Sans font
  43.             then SetGreek                          // 214 or 215 for Noto Sans Hebrew
  44.             else SetEnglish;                       // 220 or 221 for Noto Sans Syriac
  45.          end;
  46.      if (SelFontFormat.name=DefGrk) and not (DefEng=DefGrk) then SetGreek;
  47.      if (SelFontFormat.name=DefHeb) then SetHebrew;
  48.      if (SelFontFormat.name=DefSyr) then SetSyriac;
  49.  
  50. (* // for testing
  51. PageMemo.SelStart:= PageMemo.SelStart-1;
  52. PageMemo.SelLength:= 1;
  53. QryStr:= PageMemo.SelText;
  54. QryChr:= QryStr[1];
  55. PageMemo.SelStart:= PageMemo.SelStart+1;
  56. PageMemo.SelLength:= 0;
  57. showmessage(inttostr(ord(QryChr)));
  58. *)
  59.  
  60.      end; // else showmessage('ShiftOn');
  61.  
  62.   PrepareToolbar;
  63.   PageMemo.Repaint; // clears previous click and highlight shadow
  64. end;            
  65.  

I just updated the code listing by giving you the entire procedure.

Rick
« Last Edit: January 26, 2017, 06:30:04 pm by rick2691 »
Windows 11, LAZ 2.0.10, FPC 3.2.0, SVN 63526, i386-win32-win32/win64, using windows unit

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2770
    • havefunsoft.com
Re: Detecting a Unicode language family
« Reply #3 on: January 26, 2017, 07:39:13 pm »
yes, that matters.
I'd certainty recommend to use Unicode/WideStrings  over utf8 strings. Just because it's easier.

See the attached example.

You might want to copy/paste some Syriac, Greek, Hebrew and English characters into that

Code: Pascal  [Select][+][-]
  1. procedure TForm1.PageMemoClick(Sender: TObject);
  2. var
  3.   QryText : WideString;
  4.   WC      : WideChar;
  5.   lang    : string;
  6. begin
  7.   PageMemo.SelLength:=1;
  8.   QryText :=UTF8Decode(PageMemo.SelText);
  9.   if QryText<>'' then begin
  10.     WC:=QryText[1];
  11.     case wc of
  12.       #$0590..#$05FF:
  13.         lang:='hebrew';
  14.       #$0700..#$074F:
  15.         lang:='syriac';
  16.       #$1F00..#$1FFF,  // this is (modern?) greek
  17.       #$0370..#$03FF:  // this is greek and coptic
  18.          lang:='greek';
  19.     else
  20.       lang:='';
  21.     end;
  22.  
  23.     if lang<>'' then
  24.       Caption:=IntToHex(Word(WC),4)+' '+lang
  25.     else
  26.       Caption:=IntToHex(Word(WC),4);
  27.   end else
  28.     Caption:='no text?';
  29. end;
  30.  

rick2691

  • Sr. Member
  • ****
  • Posts: 444
Re: Detecting a Unicode language family
« Reply #4 on: January 26, 2017, 09:26:35 pm »
skalogryz, thanks for the help, and nicely done.

Hear is how I applied your method...
Code: Pascal  [Select][+][-]
  1. procedure TCmdForm.ClickPageMemo(Sender: TObject);
  2. var StartPos,StartLength,QryPos: longint;
  3.     QryCode : WideString;
  4.     WC      : WideChar;
  5.     lang    : string;
  6.     QryChr: char;
  7.     QryStr: string;
  8.     ShiftOn : boolean;
  9. begin
  10.   ReportPosition;
  11.   if (CurrentLin=1) and (CurrentPos=1)
  12.      then PageMemo.GetTextAttributes(PageMemo.SelStart, SelFontFormat)   // get in place
  13.      else PageMemo.GetTextAttributes(PageMemo.SelStart-1, SelFontFormat);  // get by prior
  14.   ShiftOn:= IsShiftKeyPressed;
  15.   if not ShiftOn then
  16.      begin
  17.      //activate keyboard language
  18.      StartLength:= PageMemo.SelLength;
  19.      StartPos:= PageMemo.SelStart;
  20.      if StartPos=0 then QryPos:= 0
  21.                          else QryPos:= StartPos-1;
  22.  
  23.      SendMessage(PageMemo.Handle, EM_HIDESELECTION, 1, 0); // hide selection
  24.      PageMemo.SelStart:= QryPos;
  25.      PageMemo.SelLength:=1;
  26.      QryStr:= PageMemo.SelText;
  27.      QryChr:= QryStr[1];
  28.      while (((QryChr=' ') or (QryChr=#9)) and (QryPos>0)) do
  29.            begin
  30.            QryPos:= QryPos-1;
  31.            PageMemo.SelStart:= QryPos;
  32.            PageMemo.SelLength:= 1;
  33.            QryStr:= PageMemo.SelText;
  34.            QryChr:= QryStr[1];
  35.            end;
  36.      QryCode:= UTF8Decode(QryStr);  //(PageMemo.SelText);
  37.      PageMemo.SelStart:= StartPos;
  38.      PageMemo.SelLength:= StartLength;
  39.      SendMessage(PageMemo.Handle, EM_HIDESELECTION, 0, 0); // show selection
  40.  
  41.      if QryCode<>'' then
  42.         begin
  43.         WC:=QryCode[1];
  44.         case wc of
  45.              #$0590..#$05FF: SetHebrew;  // lang:='hebrew';
  46.              #$0700..#$074F: SetSyriac;  // lang:='syriac';
  47.              #$1F00..#$1FFF, // this is modern greek
  48.              #$0370..#$03FF: // this is greek and coptic
  49.                              SetGreek; // this is either
  50.              #$0020..#$007F: SetEnglish;
  51.              else SetEnglish; // not in list  // CAPTION:= IntToHex(Word(WC),4);
  52.              end;
  53.         end; // CAPTION:= IntToHex(Word(WC),4);
  54.   end;
  55. end;
  56.  

My implementation inherits attributes from the previous character. Doing so seems logical for me, but it makes a problem. I have to activate EM_HIDESELECTION to stop the flashing, and I have not found a way to hide the caret. The caret dances around as it searches for a valid character to use as a basis for determining the active language.

Actually, I only have to do this for the Greek, because it is in Noto Sans font (which is Latin/English). I have to skip past spaces and tabs. In Hebrew or Syriac, its spaces and tabs are automatically reported as being part of the native language. Not so with the Greek. It thinks it is English.

Do you have a way for hiding the caret?

Rick

Windows 11, LAZ 2.0.10, FPC 3.2.0, SVN 63526, i386-win32-win32/win64, using windows unit

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2770
    • havefunsoft.com
Re: Detecting a Unicode language family
« Reply #5 on: January 26, 2017, 10:16:24 pm »
My implementation inherits attributes from the previous character. Doing so seems logical for me, but it makes a problem. I have to activate EM_HIDESELECTION to stop the flashing, and I have not found a way to hide the caret. The caret dances around as it searches for a valid character to use as a basis for determining the active language.
...
Do you have a way for hiding the caret?
I can think of two ways to do that:

1:
try to use GetStyleRange in conjunction with GetText.
GetStyleRange - should find the style range for you.
GetText - should extract the text for you w/o tempering with the current selection. (you might need to update to the latest revision for that, since there was a bug, that prevent any text from being extracted)

2: use Lines.BeginUpdate / Lines.EndUpdate.
Whenever a Lines.BeginUpdate is called all visual updates are stopped and will not happen (thus the caret would not flicker).

Code: [Select]
Lines.BeginUpdate;
try
  ...search for the word/style...
  ... other code...
finally
  Lines.EndUpdate;
end;

It is highly recommended to use try .. finally, in order to start and finish the update operation.
If there's any exception occurs during the processing, you want to make sure that EndUpdate is called.
Otherwise your component might look up frozen, after the exception is processed.
(an exception might be presented to a user as an error dialog. And it might not cause the application crash).


Oh yes... you also need to test the code on Phoenician languages too.  Since a character for them occupies more than a single widechar, you'll need to make your case statement more complex.
« Last Edit: January 26, 2017, 10:18:44 pm by skalogryz »

rick2691

  • Sr. Member
  • ****
  • Posts: 444
Re: Detecting a Unicode language family
« Reply #6 on: January 26, 2017, 11:21:52 pm »
Thanks. I expected that you would know a way.

I will try the 2nd option first. It looks easier, and I assume it doesn't need the upgrade. But for the 1st option... what is the revision number for the file.

I am not doing Phoenician at this point, because it triggers Font Bonding. It appears that the RichEdit driver does not like its high Unicode range.

Rick
Windows 11, LAZ 2.0.10, FPC 3.2.0, SVN 63526, i386-win32-win32/win64, using windows unit

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2770
    • havefunsoft.com
Re: Detecting a Unicode language family
« Reply #7 on: January 27, 2017, 03:00:16 am »
r5708

rick2691

  • Sr. Member
  • ****
  • Posts: 444
Re: Detecting a Unicode language family
« Reply #8 on: January 28, 2017, 02:39:34 pm »
I received this error upon compiling with r5708...

win32richmemo.pas(130,20) Error: There is no method in an ancestor class to be overridden: "class TWin32WSCustomRichMemo.GetZoomFactor(const TWinControl,var Double):Boolean;"

Rick
Windows 11, LAZ 2.0.10, FPC 3.2.0, SVN 63526, i386-win32-win32/win64, using windows unit

rick2691

  • Sr. Member
  • ****
  • Posts: 444
Re: Detecting a Unicode language family
« Reply #9 on: February 18, 2017, 08:32:32 pm »
skalogryz,

It is possible that the previous post for compiling by r5708 is related to the following. My system has crashed after that post, and I have had to rebuild my computer. Now I am faced with updating all of your revisions.

Is there a master composite of all the files that I can import as a package? Otherwise I have to update and update by their historical creation. No fun, and fraught with chances for for mistakes.

Rick
Windows 11, LAZ 2.0.10, FPC 3.2.0, SVN 63526, i386-win32-win32/win64, using windows unit

Thaddy

  • Hero Member
  • *****
  • Posts: 14201
  • Probably until I exterminate Putin.
Re: Detecting a Unicode language family
« Reply #10 on: February 18, 2017, 09:07:21 pm »
s, and I have had to rebuild my computer. Now I am faced with updating all of your revisions.
Unless you are running a 8086 with an early 8087 co-processor and you know about the firestarter virus and how to use it, it is highly unlikely that faulty software would cause you to rebuild your computer.
Again, as usual, provide us the code.... I am willing to try it, I have a safe room for that, camera to follow the explosion, but I don't think I can reproduce it... >:D :'( :-X

Note the picture is an actual hardware failure on a more modern beast. Probably D.T. forgetting that Europe has proper power supply, not a measly 110. POWER!!!!
Silly....
« Last Edit: February 18, 2017, 09:28:55 pm by Thaddy »
Specialize a type, not a var.

Cyrax

  • Hero Member
  • *****
  • Posts: 836
Re: Detecting a Unicode language family
« Reply #11 on: February 18, 2017, 10:47:19 pm »
skalogryz,

It is possible that the previous post for compiling by r5708 is related to the following. My system has crashed after that post, and I have had to rebuild my computer. Now I am faced with updating all of your revisions.

Is there a master composite of all the files that I can import as a package? Otherwise I have to update and update by their historical creation. No fun, and fraught with chances for for mistakes.

Rick

You can update to latest revision in single step. There is no need to update revision by revision unless you are doing some bug hunting.

rick2691

  • Sr. Member
  • ****
  • Posts: 444
Re: Detecting a Unicode language family
« Reply #12 on: February 19, 2017, 12:42:11 pm »
@Thaddy, my statement about r5708 and my computer crash was misleading. I was intending to suggest that the problem with r5708 was on account of a system problem in my computer.

@Cyrax, I had looked for that option but did not find one. Can you tell me where the Single-Step update is located?

Rick
Windows 11, LAZ 2.0.10, FPC 3.2.0, SVN 63526, i386-win32-win32/win64, using windows unit

Cyrax

  • Hero Member
  • *****
  • Posts: 836
Re: Detecting a Unicode language family
« Reply #13 on: February 19, 2017, 02:21:20 pm »
Are you using TortoiseSVN? If you are, then you only need to do is right click with your mouse on the directory where your sources are and select Update menu item.

See attached pictures for more info.

rick2691

  • Sr. Member
  • ****
  • Posts: 444
Re: Detecting a Unicode language family
« Reply #14 on: February 19, 2017, 03:42:09 pm »
No, I do not have Tortoise, nor an SVN.
Windows 11, LAZ 2.0.10, FPC 3.2.0, SVN 63526, i386-win32-win32/win64, using windows unit

 

TinyPortal © 2005-2018