Recent

Author Topic: Copying unicode characters from place to another.  (Read 1088 times)

el3ctrolyte

  • Guest
Copying unicode characters from place to another.
« on: March 31, 2021, 06:52:04 pm »
I am using this file: https://raw.githubusercontent.com/LukeSmithxyz/voidrice/master/.local/share/larbs/emoji . What i want to do is read from the file, the "first" charater(the emoji) into a variable. I can't seem to get this right. Lets say i have the following code:
Code: Pascal  [Select][+][-]
  1. uses
  2.   ..., lazutf8, ...
  3.  
  4. ...
  5.  
  6. var
  7.   emojis:tstringlist;
  8.   emoji:string;
  9. begin
  10.   emojis:=TStringList.Create;
  11.   emojis.LoadFromFile('emoji-list.txt');
  12.   emoji:=emojis.Strings[0][1];
  13.   showmessage(emoji);
  14. end;

When i run this code, showmessage shows nothing. If i print out the entire line instead of just the first byte, then the emoji is there with the rest of the line. I know that unicode strings use multiple bytes per character, but how can i reliably copy the first character from each line? As in just the emoji? I am on linux mint 20 and i have the right fonts.

el3ctrolyte

  • Guest
Re: Copying unicode characters from place to another.
« Reply #1 on: March 31, 2021, 07:01:48 pm »
This seems to work:
Code: Pascal  [Select][+][-]
  1. var
  2.   emojis:tstringlist;
  3.   emoji:utf8string;
  4. begin
  5.   emojis:=TStringList.Create;
  6.   emojis.LoadFromFile('emoji-list.txt');
  7.   emoji:=UTF8Copy(emojis.Strings[95],1,1);
  8.   showmessage(emoji);
  9. end;

But on line 96 of that file there is the skull and crossbones emoji that doesn't look right. If i change this line:
Code: Pascal  [Select][+][-]
  1. emoji:=UTF8Copy(emojis.Strings[95],1,1);

to this line:

Code: Pascal  [Select][+][-]
  1. emoji:=UTF8Copy(emojis.Strings[95],1,2]);

then it looks right. Notice that i am now copying two characters.

What is the best way to copy the emojis?

balazsszekely

  • Guest
Re: Copying unicode characters from place to another.
« Reply #2 on: March 31, 2021, 07:02:51 pm »
@el3ctrolyte

Try this:
Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button1Click(Sender: TObject);
  2. var
  3.   emojis:tstringlist;
  4.   emoji:string;
  5.   I: Integer;
  6.   P: Integer;
  7. begin
  8.   emojis:=TStringList.Create;
  9.   emojis.LoadFromFile('emoji-list.txt', TEncoding.UTF8);
  10.   for I := 0 to emojis.Count - 1 do
  11.   begin
  12.     P := Pos(' ', emojis.Strings[I]);
  13.     if P > 0 then
  14.     begin
  15.       emoji:=Copy(emojis.Strings[I], 1, P - 1);
  16.       showmessage(emoji);
  17.     end;
  18.   end;
  19. end;  

wp

  • Hero Member
  • *****
  • Posts: 11923
Re: Copying unicode characters from place to another.
« Reply #3 on: March 31, 2021, 07:14:00 pm »
Or use Juha's unicode iterator in unit LazUnicode:

Code: Pascal  [Select][+][-]
  1. uses
  2.   LazUnicode;
  3.  
  4. procedure TForm1.Button1Click(Sender: TObject);
  5.  var
  6.    emojiList: TStringList;
  7.    ch: String;
  8.    s: String;
  9.    extracted: String;
  10.  begin
  11.    emojiList := TStringList.Create;
  12.    try
  13.      emojiList.LoadFromFile('emoji-list.txt');
  14.      s := emojiList.Text;
  15.      extracted := '';
  16.      for ch in s do      // ch is a string representing a utf8 codepoint ("character")
  17.      begin
  18.        if Length(ch) = 1 then  // 1-byte strings are not emojis for sure.
  19.          continue;
  20.        // concatenate all emojis found to a long string to be displayed in a memo
  21.        if extracted = '' then
  22.          extracted := ch
  23.        else
  24.          extracted := extracted + ' ' + ch;
  25.      end;
  26.      // pass the emoji-string to the memo
  27.      Memo1.Lines.Text := extracted;
  28.    finally
  29.      emojiList.Free;
  30.    end;
  31. end;

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9913
  • Debugger - SynEdit - and more
    • wiki
Re: Copying unicode characters from place to another.
« Reply #4 on: March 31, 2021, 09:15:50 pm »
Well first of all, in UTF-8 emoji and many other chars are more than one byte.

However "SomeString[ i ]" returns one byte (not one char, not one codepoint => one BYTE).
Except of course for widestring/unicodestring which are 16bit strings, and return one WORD (2byte).

However if you use Utf8Copy, this will give you codepoints (not chars).
Similar with Utf16, each 16 bit word will (usually) be a codepoint (yet again not a char)

All Utf-n are "Unicode transfer encoding" => A way to e.g. put unicode into memory...

Unicode (in any encoding, even utf32) is split into codepoints. Some codepoints are "combining", and they will modify other codepoints, and form a different char together with other codepoints.

So when you get a codepoint, you need to check if it is followed by one or more combining codepoints. And that is how you get your "char".

Note that on top of that, a font can decide that the exact same char is displayed different depending on what other chars are next to it (e.g. Script fonts (Arabic) or ligatures)

 

TinyPortal © 2005-2018