Thank you all for your comments.
My task was to make a simple (
single)
diff of two very similar, but long UTF-8 strings and print out the minimum size of (
single) string that need to be deleted from, plus a (
single) string that need to be added to the first string, so that it becomes the same as the second string. For ANSI strings (where one character is one byte) this is simple:
procedure TForm1.btnDiffANSIClick(Sender: TObject);
var
str1, str2, s1, s2, e1, e2 : PChar;
del_str, ins_str : String;
pos : SizeInt;
begin
str1 := PChar(memo1.Text);
str2 := PChar(memo2.Text);
s1 := str1;
s2 := str2;
e1 := s1+length(s1)-1;
e2 := s2+length(s2)-1;
while (s1<=e1) and (s2<=e2) do
begin
if s1^<>s2^
then break;
inc(s1); inc(s2)
end;
while (s1<=e1) and (s2<=e2) do
begin
if e1^<>e2^
then break;
dec(e1); dec(e2)
end;
del_str := leftstr(s1,e1-s1+1);
ins_str := leftstr(s2,e2-s2+1);
pos := s1-str1;
memo_out.text :=
'at position: '+pos.toString+lineending+
'delete: <'+del_str+'>:'+length(del_str).ToString+lineending+
'insert: <'+ins_str+'>:'+length(ins_str).ToString;
end; // TForm1.btnDiffANSIClick
But for UTF-8 strings, it is not so simple. With your help, I think a came to a rather simple modification of the above to achieve this:
procedure TForm1.btnDiffUTF8Click(Sender: TObject);
var
str1, str2, s1, s2, e1, e2 : PChar;
del_str, ins_str : String;
pos : SizeInt;
cplen1, cplen2 : Integer;
begin
str1 := PChar(memo1.Text);
str2 := PChar(memo2.Text);
s1 := str1;
s2 := str2;
e1 := s1+length(s1)-1;
e2 := s2+length(s2)-1;
while (s1<=e1) and (s2<=e2) do
begin
if s1^<>s2^
then break;
inc(s1); inc(s2)
end;
Utf8TryFindCodepointStart(str1,s1,cplen1);
Utf8TryFindCodepointStart(str2,s2,cplen2);
while (s1<=e1) and (s2<=e2) do
begin
if e1^<>e2^
then break;
dec(e1); dec(e2)
end;
Utf8TryFindCodepointStart(str1,e1,cplen1);
Utf8TryFindCodepointStart(str2,e2,cplen2);
del_str := leftstr(s1,e1-s1+cplen1);
ins_str := leftstr(s2,e2-s2+cplen2);
pos := UTF8Length(str1,s1-str1);
memo_out.text :=
'at position: '+pos.toString+lineending+
'delete: <'+del_str+'>:'+utf8length(del_str).ToString+lineending+
'insert: <'+ins_str+'>:'+utf8length(ins_str).ToString;
end; // TForm1.btnDiffUTF8Click
By using the
lazUTF8 utility
Utf8TryFindCodepointStart, I was able not to mess up with special UTF-8 handling. The algorithm given above has far more better performance compared to the one given in my initial post (which relies too much on
Utf8Copy).