Recent

Author Topic: How to determine Unicode character type (letter, punctuation, symbol, etc.)  (Read 448 times)

Manlio

  • Full Member
  • ***
  • Posts: 162
  • Pascal dev
I need to parse Unicode strings with text in different languages, and I need to know, for every character (code point) that I parse, whether it is a letter, a digit, a symbol, or punctuation.

I tried to look into FPC unicode-related units but I didn't find anything.

Can anyone kindly point me into the right direction?

Thank you!
manlio mazzon gmail

AlexTP

  • Hero Member
  • *****
  • Posts: 2384
    • UVviewsoft
This gets category of a widechar.
One of UGC_xxx.

Code: Pascal  [Select][+][-]
  1. uses
  2.   Classes, SysUtils,
  3.   fpwidestring,
  4.   StrUtils,
  5.   unicodedata;
  6.  
  7. function IsUnicodeWordChar(AChar: WideChar): boolean;
  8. var
  9.   NType: byte;
  10. begin
  11.   if AChar='_' then
  12.     Exit(true);
  13.  
  14.   if Ord(AChar) >= LOW_SURROGATE_BEGIN then
  15.     Exit(False);
  16.  
  17.   NType := GetProps(Ord(AChar))^.Category;
  18.   Result := (NType <= UGC_OtherNumber);
  19. end;
  20.  
  21. function GetCateg(c: word): byte;
  22. begin
  23.   Result:= GetProps(c)^.Category;
  24. end;
  25.  

paweld

  • Hero Member
  • *****
  • Posts: 970
Code: Pascal  [Select][+][-]
  1. uses
  2.   LazUTF8, Character;
  3. procedure TForm1.Button1Click(Sender: TObject);
  4. var
  5.   s, r: String;
  6.   i: Integer;
  7. begin
  8.   s := 'Test 12,3 gęŚlą jaŹń 〇⌀→Ⓣ■•';
  9.   for i := 1 to UTF8Length(s) do
  10.   begin
  11.     r := '';
  12.     if IsLetter(s, i) then
  13.     begin
  14.       if IsLower(s, i) then
  15.         r := ' > Lower letter'
  16.       else
  17.         r := ' > Upper letter'
  18.     end
  19.     else if IsNumber(s, i) then
  20.       r := ' > Number'
  21.     else if IsPunctuation(s, i) then
  22.       r := ' > Punctation'
  23.     else if IsSeparator(s, i) then
  24.       r := ' > Separator'
  25.     else if IsSymbol(s, i) then
  26.      r := ' > Symbol'
  27.     else
  28.       r := ' > ???';
  29.     Memo1.Lines.Add(UTF8Copy(s, i, 1) + r);
  30.   end;
  31. end;
Best regards / Pozdrawiam
paweld

Manlio

  • Full Member
  • ***
  • Posts: 162
  • Pascal dev
Thank you both for the great working code!

For everyone else interested, the "Character" unit is where the magic happens.
manlio mazzon gmail

 

TinyPortal © 2005-2018