Recent

Author Topic: Iterating a UTF8 string from the end  (Read 2987 times)

fedkad

  • Full Member
  • ***
  • Posts: 176
Iterating a UTF8 string from the end
« on: February 16, 2018, 04:44:57 pm »
An efficient way to iterate codepoints in a UTF8 string (in forward direction) is this:

Code: Pascal  [Select][+][-]
  1. procedure IterateUTF8(S: String);
  2. var
  3.   CurP, EndP: PChar;
  4.   Len: Integer;
  5.   ACodePoint: String;
  6. begin
  7.   CurP := PChar(S);        // if S='' then PChar(S) returns a pointer to #0
  8.   EndP := CurP + length(S);
  9.   while CurP < EndP do
  10.   begin
  11.     Len := UTF8CodepointSize(CurP);
  12.     SetLength(ACodePoint, Len);
  13.     Move(CurP^, ACodePoint[1], Len);
  14.     // A single codepoint is copied from the string. Do your thing with it.
  15.     ShowMessageFmt('CodePoint=%s, Len=%d', [ACodePoint, Len]);
  16.     // ...
  17.     inc(CurP, Len);
  18.   end;
  19. end;

This is taken from http://wiki.freepascal.org/UTF8_strings_and_characters#Iterating_over_string_analysing_individual_codepoints


Is there an efficient way to iterate codepoints from the end of UTF8 strings, i.e. in reverse order? I need an efficient way to match the ends of two UTF8 strings like this:

Code: Pascal  [Select][+][-]
  1. var
  2.   j1, j2 : SizeInt;
  3.   s1, s2 : String;
  4. [...]
  5.   j1 := utf8length(s1);
  6.   j2 := utf8length(s2);
  7.   while (j1>=1) and (j2>=1) do
  8.     if utf8copy(s1,j1,1)<>utf8copy(s2,j2,1)
  9.       then break
  10.       else begin dec(j1); dec(j2) end;
  11. [...]
  12.  

As you will notice the above algorithm is very inefficient. Any suggestions?
Lazarus 2.2.6 / FPC 3.2.2 on x86_64-linux-gtk2 (Ubuntu/GNOME) and x86_64-win64-win32/win64 (Windows 11)

howardpc

  • Hero Member
  • *****
  • Posts: 4144
Re: Iterating a UTF8 string from the end
« Reply #1 on: February 16, 2018, 06:02:26 pm »
I have not timed anything here, but the following may be more efficient than using utf8Copy, since it does not use any utf8-specific functions.
The function returns the matching end part of the two strings passed as parameters, or an empty string if nothing at the ends of the strings matches.

Code: Pascal  [Select][+][-]
  1. function TForm1.MatchEnds(const aS1, aS2: String): String;
  2. var
  3.   longer, shorter: Integer;
  4.   long, short: String;
  5.   match: Boolean = True;
  6. begin
  7.   if Length(aS1) > Length(aS2) then begin
  8.     longer:=Length(aS1);
  9.     long:=aS1;
  10.     shorter:=Length(aS2);
  11.     short:=aS2;
  12.   end
  13.   else begin
  14.     longer:=Length(aS2);
  15.     long:=aS2;
  16.     shorter:=Length(aS1);
  17.     short:=aS1;
  18.   end;
  19.  
  20.   while match and (shorter > 1) do begin
  21.     match:=short[shorter] = long[longer];
  22.     Dec(shorter);
  23.     Dec(longer);
  24.   end;
  25.   Exit(Copy(short, shorter, Length(short)));
  26. end;

fedkad

  • Full Member
  • ***
  • Posts: 176
Re: Iterating a UTF8 string from the end
« Reply #2 on: February 16, 2018, 07:03:21 pm »
Sorry howardpc, but your code is wrong in many ways. Please, test your code before posting.

Also, please test your code with the following two strings before posting:

abcαβγδ
defϱβγδ

Thank you.
Lazarus 2.2.6 / FPC 3.2.2 on x86_64-linux-gtk2 (Ubuntu/GNOME) and x86_64-win64-win32/win64 (Windows 11)

wp

  • Hero Member
  • *****
  • Posts: 11916
Re: Iterating a UTF8 string from the end
« Reply #3 on: February 16, 2018, 07:21:23 pm »
Please, test your code before posting.
No, you should not expect this. People helping here do do this for fun, not for money. Nobody is obliged to give correct answers, we're only human, and sometimes the person who answers a question just does not have the time or is willing to write a test program. But everybody's doing the best he can. Take what you can get (you don't have to pay!), don't complain about a false anwer. Sometimes even incorrect code can give you the idea how to solve an issue.

howardpc

  • Hero Member
  • *****
  • Posts: 4144
Re: Iterating a UTF8 string from the end
« Reply #4 on: February 16, 2018, 09:12:04 pm »
Here's a better attempt (not perfect, I'm sure).
It does at least give correct results on the two strings fedkad supplied.
I don't guarantee there are other string pairs on which it fails to give a correct answer.

Code: Pascal  [Select][+][-]
  1. function TForm1.MatchEnds(const aS1, aS2: String): String;
  2. var
  3.   pc1: PChar;
  4.   pc2: PChar;
  5.   minLen: Integer;
  6.   match: Boolean;
  7. begin
  8.   minLen := Length(aS1);
  9.   if Length(aS2) < minLen then
  10.     minLen := Length(aS2);
  11.   pc1 := PChar(aS1);
  12.   pc2 := PChar(as2);
  13.   Inc(pc1, Length(aS1));
  14.   Inc(pc2, Length(pc2));
  15.   repeat
  16.     match := pc1^ = pc2^;
  17.     Dec(minLen);
  18.     Dec(pc1);
  19.     Dec(pc2);
  20.   until not Match or (minLen <= 0);
  21.   case minLen = 0 of
  22.     True: Exit(aS1);
  23.     False: begin
  24.              Inc(pc1);
  25.              Exit(StrPas(pc1 + UTF8CharacterLengthFast(pc1)));
  26.            end;
  27.   end;
  28. end;

RayoGlauco

  • Full Member
  • ***
  • Posts: 179
  • Beers: 1567
Re: Iterating a UTF8 string from the end
« Reply #5 on: February 16, 2018, 11:59:46 pm »
I think you can just compare ASCII chars until you find a difference:

Code: Pascal  [Select][+][-]
  1. var
  2.   j1, j2 : SizeInt;
  3.   s1, s2 : String;
  4. [...]
  5.   j1 := length(s1);
  6.   j2 := length(s2);
  7.   while (j1>=1) and (j2>=1) do
  8.     if s1[j1]<>s2[j2]
  9.       then break
  10.       else begin dec(j1); dec(j2) end;
  11. ...

Finally, when you find the first difference, you can know where that codepoint starts, due to the structure of UTF8 (see http://wiki.freepascal.org/UTF8_strings_and_characters : You can always find the start of a multi-byte codepoint even if you jumped to a random byte position).

If the different byte is <128, that's a simple ASCII char; else you must find the first byte >191 (its binary encoding is 11xxxxxx), because all the multi-byte codepoints start with a byte > 191 and have no other byte > 191.
« Last Edit: February 17, 2018, 01:54:08 am by RayoGlauco »
To err is human, but to really mess things up, you need a computer.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4468
  • I like bugs.
Re: Iterating a UTF8 string from the end
« Reply #6 on: February 17, 2018, 09:47:11 am »
If the different byte is <128, that's a simple ASCII char; else you must find the first byte >191 (its binary encoding is 11xxxxxx), because all the multi-byte codepoints start with a byte > 191 and have no other byte > 191.
Yes. User "tomitomy" created clever code for it. ... Found it:
 http://forum.lazarus.freepascal.org/index.php/topic,38872.msg265435.html#msg265435
Should we add a library function based on that code? I am not sure it is general purpose enough.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

wp

  • Hero Member
  • *****
  • Posts: 11916
Re: Iterating a UTF8 string from the end
« Reply #7 on: February 17, 2018, 01:04:23 pm »
This one seems to work (see also attached test program), no guarantee however:
Code: Pascal  [Select][+][-]
  1. function UTF8MatchRev(const s1, s2: String): String;
  2. var
  3.   p1, p10: PChar;
  4.   p2, p20: PChar;
  5.   b: Byte;
  6. begin
  7.   Result := '';
  8.   if s1 = '' then exit;
  9.   if s2 = '' then exit;
  10.   p10 := PChar(@s1[1]);
  11.   p20 := PChar(@s2[1]);
  12.   p1 := PChar(@s1[Length(s1)]);
  13.   p2 := PChar(@s2[length(s2)]);
  14.   // Find first different byte, coming from string end
  15.   while true do begin
  16.     if p1^ = p2^ then begin
  17.       if p1 = p10 then break;
  18.       if p2 = p20 then break;
  19.       dec(p1);
  20.       dec(p2);
  21.     end else begin
  22.       inc(p1);
  23.       break;
  24.     end;
  25.   end;
  26.   // Check for invalid UTF8 codepoint
  27.   b := ord(p1^) shr 6;
  28.   while (p1 < p10 + Length(s1)) and (ord(p1^) shr 6 = 2) do
  29.     inc(p1);
  30.   Result := StrPas(p1);
  31. end;

Coming from the end it scans the string for the first different byte. Then it must check whether the result string begins with an invalid utf8 codepoint because the ending codepoint byte(s) may match, but the first bytes may be different and thus will be cut off. It recognizes the "follower" bytes (those after the 1st codepoint byte) by their signature %10xxxxxx - see http://wiki.freepascal.org/UTF8_strings_and_characters).

fedkad

  • Full Member
  • ***
  • Posts: 176
Re: Iterating a UTF8 string from the end
« Reply #8 on: February 17, 2018, 03:07:33 pm »
Thank you all for your comments.

My task was to make a simple (single) diff of two very similar, but long UTF-8 strings and print out the minimum size of (single) string that need to be deleted from, plus a (single) string that need to be added to the first string, so that it becomes the same as the second string. For ANSI strings (where one character is one byte) this is simple:
Code: Pascal  [Select][+][-]
  1. procedure TForm1.btnDiffANSIClick(Sender: TObject);
  2. var
  3.   str1, str2, s1, s2, e1, e2 : PChar;
  4.   del_str, ins_str : String;
  5.   pos : SizeInt;
  6. begin
  7.   str1 := PChar(memo1.Text);
  8.   str2 := PChar(memo2.Text);
  9.   s1 := str1;
  10.   s2 := str2;
  11.   e1 := s1+length(s1)-1;
  12.   e2 := s2+length(s2)-1;
  13.   while (s1<=e1) and (s2<=e2) do
  14.   begin
  15.     if s1^<>s2^
  16.       then break;
  17.     inc(s1); inc(s2)
  18.   end;
  19.   while (s1<=e1) and (s2<=e2) do
  20.   begin
  21.     if e1^<>e2^
  22.       then break;
  23.     dec(e1); dec(e2)
  24.   end;
  25.   del_str := leftstr(s1,e1-s1+1);
  26.   ins_str := leftstr(s2,e2-s2+1);
  27.   pos := s1-str1;
  28.   memo_out.text :=
  29.   'at position: '+pos.toString+lineending+
  30.   'delete: <'+del_str+'>:'+length(del_str).ToString+lineending+
  31.   'insert: <'+ins_str+'>:'+length(ins_str).ToString;
  32. end;  // TForm1.btnDiffANSIClick

But for UTF-8 strings, it is not so simple. With your help, I think a came to a rather simple modification of the above to achieve this:

Code: Pascal  [Select][+][-]
  1. procedure TForm1.btnDiffUTF8Click(Sender: TObject);
  2. var
  3.   str1, str2, s1, s2, e1, e2 : PChar;
  4.   del_str, ins_str : String;
  5.   pos : SizeInt;
  6.   cplen1, cplen2 : Integer;
  7. begin
  8.   str1 := PChar(memo1.Text);
  9.   str2 := PChar(memo2.Text);
  10.   s1 := str1;
  11.   s2 := str2;
  12.   e1 := s1+length(s1)-1;
  13.   e2 := s2+length(s2)-1;
  14.   while (s1<=e1) and (s2<=e2) do
  15.   begin
  16.     if s1^<>s2^
  17.       then break;
  18.     inc(s1); inc(s2)
  19.   end;
  20.   Utf8TryFindCodepointStart(str1,s1,cplen1);
  21.   Utf8TryFindCodepointStart(str2,s2,cplen2);
  22.   while (s1<=e1) and (s2<=e2) do
  23.   begin
  24.     if e1^<>e2^
  25.       then break;
  26.     dec(e1); dec(e2)
  27.   end;
  28.   Utf8TryFindCodepointStart(str1,e1,cplen1);
  29.   Utf8TryFindCodepointStart(str2,e2,cplen2);
  30.   del_str := leftstr(s1,e1-s1+cplen1);
  31.   ins_str := leftstr(s2,e2-s2+cplen2);
  32.   pos := UTF8Length(str1,s1-str1);
  33.   memo_out.text :=
  34.   'at position: '+pos.toString+lineending+
  35.   'delete: <'+del_str+'>:'+utf8length(del_str).ToString+lineending+
  36.   'insert: <'+ins_str+'>:'+utf8length(ins_str).ToString;
  37. end;  // TForm1.btnDiffUTF8Click

By using the lazUTF8 utility Utf8TryFindCodepointStart, I was able not to mess up with special UTF-8 handling. The algorithm given above has far more better performance compared to the one given in my initial post (which relies too much on Utf8Copy).
« Last Edit: February 17, 2018, 03:16:35 pm by fedkad »
Lazarus 2.2.6 / FPC 3.2.2 on x86_64-linux-gtk2 (Ubuntu/GNOME) and x86_64-win64-win32/win64 (Windows 11)

 

TinyPortal © 2005-2018