Iterating a UTF8 string from the end

fedkad

Full Member
Posts: 176

Iterating a UTF8 string from the end

« on: February 16, 2018, 04:44:57 pm »

An efficient way to iterate codepoints in a UTF8 string (in forward direction) is this:

procedure IterateUTF8(S: String);
var
  CurP, EndP: PChar;
  Len: Integer;
  ACodePoint: String;
begin
  CurP := PChar(S);        // if S='' then PChar(S) returns a pointer to #0
  EndP := CurP + length(S);
  while CurP < EndP do
  begin
    Len := UTF8CodepointSize(CurP);
    SetLength(ACodePoint, Len);
    Move(CurP^, ACodePoint[1], Len);
    // A single codepoint is copied from the string. Do your thing with it.
    ShowMessageFmt('CodePoint=%s, Len=%d', [ACodePoint, Len]);
    // ...
    inc(CurP, Len);
  end;
end;

This is taken from http://wiki.freepascal.org/UTF8_strings_and_characters#Iterating_over_string_analysing_individual_codepoints

Is there an efficient way to iterate codepoints from the end of UTF8 strings, i.e. in reverse order? I need an efficient way to match the ends of two UTF8 strings like this:

Code: Pascal [Select][+]

var
  j1, j2 : SizeInt;
  s1, s2 : String;
[...]
  j1 := utf8length(s1);
  j2 := utf8length(s2);
  while (j1>=1) and (j2>=1) do
    if utf8copy(s1,j1,1)<>utf8copy(s2,j2,1)
      then break
      else begin dec(j1); dec(j2) end;
[...]
 

As you will notice the above algorithm is very inefficient. Any suggestions?

Logged

Lazarus 2.2.6 / FPC 3.2.2 on x86_64-linux-gtk2 (Ubuntu/GNOME) and x86_64-win64-win32/win64 (Windows 11)

howardpc

Hero Member
Posts: 4144

Re: Iterating a UTF8 string from the end

« Reply #1 on: February 16, 2018, 06:02:26 pm »

I have not timed anything here, but the following may be more efficient than using utf8Copy, since it does not use any utf8-specific functions.
The function returns the matching end part of the two strings passed as parameters, or an empty string if nothing at the ends of the strings matches.

Code: Pascal [Select][+]

function TForm1.MatchEnds(const aS1, aS2: String): String;
var
  longer, shorter: Integer;
  long, short: String;
  match: Boolean = True;
begin
  if Length(aS1) > Length(aS2) then begin
    longer:=Length(aS1);
    long:=aS1;
    shorter:=Length(aS2);
    short:=aS2;
  end
  else begin
    longer:=Length(aS2);
    long:=aS2;
    shorter:=Length(aS1);
    short:=aS1;
  end;
 
  while match and (shorter > 1) do begin
    match:=short[shorter] = long[longer];
    Dec(shorter);
    Dec(longer);
  end;
  Exit(Copy(short, shorter, Length(short)));
end;

Logged

fedkad

Full Member
Posts: 176

Re: Iterating a UTF8 string from the end

« Reply #2 on: February 16, 2018, 07:03:21 pm »

Sorry howardpc, but your code is wrong in many ways. Please, test your code before posting.

Also, please test your code with the following two strings before posting:
abcαβγδ defϱβγδ
Thank you.

Logged

Lazarus 2.2.6 / FPC 3.2.2 on x86_64-linux-gtk2 (Ubuntu/GNOME) and x86_64-win64-win32/win64 (Windows 11)

wp

Hero Member
Posts: 11916

Re: Iterating a UTF8 string from the end

« Reply #3 on: February 16, 2018, 07:21:23 pm »

Quote from: fedkad on February 16, 2018, 07:03:21 pm

Please, test your code before posting.

No, you should not expect this. People helping here do do this for fun, not for money. Nobody is obliged to give correct answers, we're only human, and sometimes the person who answers a question just does not have the time or is willing to write a test program. But everybody's doing the best he can. Take what you can get (you don't have to pay!), don't complain about a false anwer. Sometimes even incorrect code can give you the idea how to solve an issue.

Logged

howardpc

Hero Member
Posts: 4144

Re: Iterating a UTF8 string from the end

« Reply #4 on: February 16, 2018, 09:12:04 pm »

Here's a better attempt (not perfect, I'm sure).
It does at least give correct results on the two strings fedkad supplied.
I don't guarantee there are other string pairs on which it fails to give a correct answer.

Code: Pascal [Select][+]

function TForm1.MatchEnds(const aS1, aS2: String): String;
var
  pc1: PChar;
  pc2: PChar;
  minLen: Integer;
  match: Boolean;
begin
  minLen := Length(aS1);
  if Length(aS2) < minLen then
    minLen := Length(aS2);
  pc1 := PChar(aS1);
  pc2 := PChar(as2);
  Inc(pc1, Length(aS1));
  Inc(pc2, Length(pc2));
  repeat
    match := pc1^ = pc2^;
    Dec(minLen);
    Dec(pc1);
    Dec(pc2);
  until not Match or (minLen <= 0);
  case minLen = 0 of
    True: Exit(aS1);
    False: begin
             Inc(pc1);
             Exit(StrPas(pc1 + UTF8CharacterLengthFast(pc1)));
           end;
  end;
end;

Logged

RayoGlauco

Full Member
Posts: 179
Beers: 1567

Re: Iterating a UTF8 string from the end

« Reply #5 on: February 16, 2018, 11:59:46 pm »

I think you can just compare ASCII chars until you find a difference:

Code: Pascal [Select][+]

var
  j1, j2 : SizeInt;
  s1, s2 : String;
[...]
  j1 := length(s1);
  j2 := length(s2);
  while (j1>=1) and (j2>=1) do
    if s1[j1]<>s2[j2]
      then break
      else begin dec(j1); dec(j2) end;
...

Finally, when you find the first difference, you can know where that codepoint starts, due to the structure of UTF8 (see http://wiki.freepascal.org/UTF8_strings_and_characters : You can always find the start of a multi-byte codepoint even if you jumped to a random byte position).

If the different byte is <128, that's a simple ASCII char; else you must find the first byte >191 (its binary encoding is 11xxxxxx), because all the multi-byte codepoints start with a byte > 191 and have no other byte > 191.

« Last Edit: February 17, 2018, 01:54:08 am by RayoGlauco »

Logged

To err is human, but to really mess things up, you need a computer.

JuhaManninen

Global Moderator
Hero Member
Posts: 4468
I like bugs.

Re: Iterating a UTF8 string from the end

« Reply #6 on: February 17, 2018, 09:47:11 am »

Quote from: RayoGlauco on February 16, 2018, 11:59:46 pm

If the different byte is <128, that's a simple ASCII char; else you must find the first byte >191 (its binary encoding is 11xxxxxx), because all the multi-byte codepoints start with a byte > 191 and have no other byte > 191.

Yes. User "tomitomy" created clever code for it. ... Found it:
http://forum.lazarus.freepascal.org/index.php/topic,38872.msg265435.html#msg265435
Should we add a library function based on that code? I am not sure it is general purpose enough.

Logged

Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

wp

Hero Member
Posts: 11916

Re: Iterating a UTF8 string from the end

« Reply #7 on: February 17, 2018, 01:04:23 pm »

This one seems to work (see also attached test program), no guarantee however:

Code: Pascal [Select][+]

function UTF8MatchRev(const s1, s2: String): String;
var
  p1, p10: PChar;
  p2, p20: PChar;
  b: Byte;
begin
  Result := '';
  if s1 = '' then exit;
  if s2 = '' then exit;
  p10 := PChar(@s1[1]);
  p20 := PChar(@s2[1]);
  p1 := PChar(@s1[Length(s1)]);
  p2 := PChar(@s2[length(s2)]);
  // Find first different byte, coming from string end
  while true do begin
    if p1^ = p2^ then begin
      if p1 = p10 then break;
      if p2 = p20 then break;
      dec(p1);
      dec(p2);
    end else begin
      inc(p1);
      break;
    end;
  end;
  // Check for invalid UTF8 codepoint
  b := ord(p1^) shr 6;
  while (p1 < p10 + Length(s1)) and (ord(p1^) shr 6 = 2) do
    inc(p1);
  Result := StrPas(p1);
end;

Coming from the end it scans the string for the first different byte. Then it must check whether the result string begins with an invalid utf8 codepoint because the ending codepoint byte(s) may match, but the first bytes may be different and thus will be cut off. It recognizes the "follower" bytes (those after the 1st codepoint byte) by their signature %10xxxxxx - see http://wiki.freepascal.org/UTF8_strings_and_characters).

MatchingStringsReverse.zip (2.49 kB - downloaded 88 times.)

Logged

fedkad

Full Member
Posts: 176

Re: Iterating a UTF8 string from the end

« Reply #8 on: February 17, 2018, 03:07:33 pm »

Thank you all for your comments.

My task was to make a simple (single) diff of two very similar, but long UTF-8 strings and print out the minimum size of (single) string that need to be deleted from, plus a (single) string that need to be added to the first string, so that it becomes the same as the second string. For ANSI strings (where one character is one byte) this is simple:

Code: Pascal [Select][+]

procedure TForm1.btnDiffANSIClick(Sender: TObject);
var
  str1, str2, s1, s2, e1, e2 : PChar;
  del_str, ins_str : String;
  pos : SizeInt;
begin
  str1 := PChar(memo1.Text);
  str2 := PChar(memo2.Text);
  s1 := str1;
  s2 := str2;
  e1 := s1+length(s1)-1;
  e2 := s2+length(s2)-1;
  while (s1<=e1) and (s2<=e2) do
  begin
    if s1^<>s2^
      then break;
    inc(s1); inc(s2)
  end;
  while (s1<=e1) and (s2<=e2) do
  begin
    if e1^<>e2^
      then break;
    dec(e1); dec(e2)
  end;
  del_str := leftstr(s1,e1-s1+1);
  ins_str := leftstr(s2,e2-s2+1);
  pos := s1-str1;
  memo_out.text :=
  'at position: '+pos.toString+lineending+
  'delete: <'+del_str+'>:'+length(del_str).ToString+lineending+
  'insert: <'+ins_str+'>:'+length(ins_str).ToString;
end;  // TForm1.btnDiffANSIClick

But for UTF-8 strings, it is not so simple. With your help, I think a came to a rather simple modification of the above to achieve this:

Code: Pascal [Select][+]

procedure TForm1.btnDiffUTF8Click(Sender: TObject);
var
  str1, str2, s1, s2, e1, e2 : PChar;
  del_str, ins_str : String;
  pos : SizeInt;
  cplen1, cplen2 : Integer;
begin
  str1 := PChar(memo1.Text);
  str2 := PChar(memo2.Text);
  s1 := str1;
  s2 := str2;
  e1 := s1+length(s1)-1;
  e2 := s2+length(s2)-1;
  while (s1<=e1) and (s2<=e2) do
  begin
    if s1^<>s2^
      then break;
    inc(s1); inc(s2)
  end;
  Utf8TryFindCodepointStart(str1,s1,cplen1);
  Utf8TryFindCodepointStart(str2,s2,cplen2);
  while (s1<=e1) and (s2<=e2) do
  begin
    if e1^<>e2^
      then break;
    dec(e1); dec(e2)
  end;
  Utf8TryFindCodepointStart(str1,e1,cplen1);
  Utf8TryFindCodepointStart(str2,e2,cplen2);
  del_str := leftstr(s1,e1-s1+cplen1);
  ins_str := leftstr(s2,e2-s2+cplen2);
  pos := UTF8Length(str1,s1-str1);
  memo_out.text :=
  'at position: '+pos.toString+lineending+
  'delete: <'+del_str+'>:'+utf8length(del_str).ToString+lineending+
  'insert: <'+ins_str+'>:'+utf8length(ins_str).ToString;
end;  // TForm1.btnDiffUTF8Click

By using the lazUTF8 utility Utf8TryFindCodepointStart, I was able not to mess up with special UTF-8 handling. The algorithm given above has far more better performance compared to the one given in my initial post (which relies too much on Utf8Copy).

« Last Edit: February 17, 2018, 03:16:35 pm by fedkad »

Logged

Lazarus 2.2.6 / FPC 3.2.2 on x86_64-linux-gtk2 (Ubuntu/GNOME) and x86_64-win64-win32/win64 (Windows 11)

Lazarus

Bookstore

Search

Recent

Author Topic: Iterating a UTF8 string from the end (Read 2987 times)

fedkad

Iterating a UTF8 string from the end

howardpc

Re: Iterating a UTF8 string from the end

fedkad

Re: Iterating a UTF8 string from the end

wp

Re: Iterating a UTF8 string from the end

howardpc

Re: Iterating a UTF8 string from the end

RayoGlauco

Re: Iterating a UTF8 string from the end

JuhaManninen

Re: Iterating a UTF8 string from the end

wp

Re: Iterating a UTF8 string from the end

fedkad

Re: Iterating a UTF8 string from the end

	Computer Math and Games in Pascal (preview)
	Lazarus Handbook