Copying unicode characters from place to another.

el3ctrolyte

Guest

Copying unicode characters from place to another.

« on: March 31, 2021, 06:52:04 pm »

I am using this file: https://raw.githubusercontent.com/LukeSmithxyz/voidrice/master/.local/share/larbs/emoji . What i want to do is read from the file, the "first" charater(the emoji) into a variable. I can't seem to get this right. Lets say i have the following code:

Code: Pascal [Select][+]

uses
  ..., lazutf8, ...
 
...
 
var
  emojis:tstringlist;
  emoji:string;
begin
  emojis:=TStringList.Create;
  emojis.LoadFromFile('emoji-list.txt');
  emoji:=emojis.Strings[0][1];
  showmessage(emoji);
end;

When i run this code, showmessage shows nothing. If i print out the entire line instead of just the first byte, then the emoji is there with the rest of the line. I know that unicode strings use multiple bytes per character, but how can i reliably copy the first character from each line? As in just the emoji? I am on linux mint 20 and i have the right fonts.

Logged

el3ctrolyte

Guest

Re: Copying unicode characters from place to another.

« Reply #1 on: March 31, 2021, 07:01:48 pm »

This seems to work:

Code: Pascal [Select][+]

var
  emojis:tstringlist;
  emoji:utf8string;
begin
  emojis:=TStringList.Create;
  emojis.LoadFromFile('emoji-list.txt');
  emoji:=UTF8Copy(emojis.Strings[95],1,1);
  showmessage(emoji);
end;

But on line 96 of that file there is the skull and crossbones emoji that doesn't look right. If i change this line:

Code: Pascal [Select][+]

emoji:=UTF8Copy(emojis.Strings[95],1,1);

to this line:

Code: Pascal [Select][+]

emoji:=UTF8Copy(emojis.Strings[95],1,2]);

then it looks right. Notice that i am now copying two characters.

What is the best way to copy the emojis?

Logged

balazsszekely

Guest

Re: Copying unicode characters from place to another.

« Reply #2 on: March 31, 2021, 07:02:51 pm »

@el3ctrolyte

Try this:

Code: Pascal [Select][+]

procedure TForm1.Button1Click(Sender: TObject);
var
  emojis:tstringlist;
  emoji:string;
  I: Integer;
  P: Integer;
begin
  emojis:=TStringList.Create;
  emojis.LoadFromFile('emoji-list.txt', TEncoding.UTF8);
  for I := 0 to emojis.Count - 1 do
  begin
    P := Pos(' ', emojis.Strings[I]);
    if P > 0 then
    begin
      emoji:=Copy(emojis.Strings[I], 1, P - 1);
      showmessage(emoji);
    end;
  end;
end;  

Logged

wp

Hero Member
Posts: 11923

Re: Copying unicode characters from place to another.

« Reply #3 on: March 31, 2021, 07:14:00 pm »

Or use Juha's unicode iterator in unit LazUnicode:

Code: Pascal [Select][+]

uses
  LazUnicode; 
 
procedure TForm1.Button1Click(Sender: TObject);
 var
   emojiList: TStringList;
   ch: String;
   s: String;
   extracted: String;
 begin
   emojiList := TStringList.Create;
   try
     emojiList.LoadFromFile('emoji-list.txt');
     s := emojiList.Text;
     extracted := '';
     for ch in s do      // ch is a string representing a utf8 codepoint ("character")
     begin
       if Length(ch) = 1 then  // 1-byte strings are not emojis for sure.
         continue;
       // concatenate all emojis found to a long string to be displayed in a memo
       if extracted = '' then
         extracted := ch
       else
         extracted := extracted + ' ' + ch;
     end;
     // pass the emoji-string to the memo
     Memo1.Lines.Text := extracted;
   finally
     emojiList.Free;
   end;
end;

Logged

Martin_fr

Administrator
Hero Member
Posts: 9913
Debugger - SynEdit - and more

Re: Copying unicode characters from place to another.

« Reply #4 on: March 31, 2021, 09:15:50 pm »

Well first of all, in UTF-8 emoji and many other chars are more than one byte.

However "SomeString[ i ]" returns one byte (not one char, not one codepoint => one BYTE).
Except of course for widestring/unicodestring which are 16bit strings, and return one WORD (2byte).

However if you use Utf8Copy, this will give you codepoints (not chars).
Similar with Utf16, each 16 bit word will (usually) be a codepoint (yet again not a char)

All Utf-n are "Unicode transfer encoding" => A way to e.g. put unicode into memory...

Unicode (in any encoding, even utf32) is split into codepoints. Some codepoints are "combining", and they will modify other codepoints, and form a different char together with other codepoints.

So when you get a codepoint, you need to check if it is followed by one or more combining codepoints. And that is how you get your "char".

Note that on top of that, a font can decide that the exact same char is displayed different depending on what other chars are next to it (e.g. Script fonts (Arabic) or ligatures)

Logged

From the wiki: Ide Tools, Code completion and more / IDE cool features / Debugger Status

Lazarus

Bookstore

Search

Recent

Author Topic: Copying unicode characters from place to another. (Read 1088 times)

el3ctrolyte

Copying unicode characters from place to another.

el3ctrolyte

Re: Copying unicode characters from place to another.

balazsszekely

Re: Copying unicode characters from place to another.

wp

Re: Copying unicode characters from place to another.

Martin_fr

Re: Copying unicode characters from place to another.

	Computer Math and Games in Pascal (preview)
	Lazarus Handbook