Recent

Author Topic: [SOLVED]How to get the position of different characters in UTF8 strings  (Read 11455 times)

tomitomy

  • Sr. Member
  • ****
  • Posts: 251
Hi, everybody, I need help. I want to get the position of different characters in two UTF8 strings, I use the following code, but I can not get the correct result, can someone help me?


Code: Pascal  [Select][+][-]
  1. procedure TForm1.FormCreate(Sender: TObject);
  2. var
  3.   i: integer;
  4.   Str1 : string = '一二三四五六七八九十';
  5.   Str2 : string = '一二三四五六七十九十';
  6. begin
  7.   for i := 1 to Str1.Length - 1 do begin
  8.     // writeln(i, '  ', Ord(Str1[i]), '  ', Ord(Str2[i]));
  9.     if Str1[i] <> Str2[i] then break;
  10.   end;
  11.   writeln(Str1.Substring(1, i - 1)); // I want get '一二三四五六七', but not
  12. end;
« Last Edit: November 10, 2017, 04:04:42 am by tomitomy »

bylaardt

  • Sr. Member
  • ****
  • Posts: 310
Re: How to get the position of different characters in UTF8 strings
« Reply #1 on: November 08, 2017, 06:13:10 pm »
try this:
Code: Pascal  [Select][+][-]
  1.     procedure TForm1.FormCreate(Sender: TObject);
  2.     var
  3.       i: integer;
  4.       Str1 : string = '一二三四五六七八九十';
  5.       Str2 : string = '一二三四五六七十九十';
  6.     begin
  7.       for i := 1 to UTF8Length(Str1) do
  8.         if utf8copy(str1,i,1) <> utf8copy(str2,i,1) then
  9.           writeln('%s and %s are diferents in %d position',
  10.             [utf8copy(str1,i,1) ,
  11.              utf8copy(str2,i,1) ,
  12.              i ]);
  13.     end;

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: How to get the position of different characters in UTF8 strings
« Reply #2 on: November 08, 2017, 06:14:20 pm »
These are "Code Points". Use CodePointCopy from LazUnicode.

tomitomy

  • Sr. Member
  • ****
  • Posts: 251
Re: How to get the position of different characters in UTF8 strings
« Reply #3 on: November 09, 2017, 01:42:41 am »
try this:
Code: Pascal  [Select][+][-]
  1.     procedure TForm1.FormCreate(Sender: TObject);
  2.     var
  3.       i: integer;
  4.       Str1 : string = '一二三四五六七八九十';
  5.       Str2 : string = '一二三四五六七十九十';
  6.     begin
  7.       for i := 1 to UTF8Length(Str1) do
  8.         if utf8copy(str1,i,1) <> utf8copy(str2,i,1) then
  9.           writeln('%s and %s are diferents in %d position',
  10.             [utf8copy(str1,i,1) ,
  11.              utf8copy(str2,i,1) ,
  12.              i ]);
  13.     end;

Thank you bylaardt, Your code works, but I have a doubt, I don't know whether UTF8Length and UTF8Copy will affect efficiency, because I need this function to be called frequently. and the text maybe very long.

tomitomy

  • Sr. Member
  • ****
  • Posts: 251
Re: How to get the position of different characters in UTF8 strings
« Reply #4 on: November 09, 2017, 01:46:12 am »
These are "Code Points". Use CodePointCopy from LazUnicode.

Thank you engkin, but I don't want copy string, I just want get the position of the different character.

jamie

  • Hero Member
  • *****
  • Posts: 7602
Re: How to get the position of different characters in UTF8 strings
« Reply #5 on: November 09, 2017, 02:10:26 am »
can you use UnicodeString ?

Those are WideStrings that are referenced counted.

 The POS function will work on UnicodeStrings.

 What I have done to ease things is to define a unicodestring := Utf8String;

 Then I use all the common string index functions.

 You can always move it back to uft8 afterwards.

 You need to test to see if you lose any charactors using your code page.

The only true wisdom is knowing you know nothing

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: How to get the position of different characters in UTF8 strings
« Reply #6 on: November 09, 2017, 04:05:21 am »
These are "Code Points". Use CodePointCopy from LazUnicode.

Thank you engkin, but I don't want copy string, I just want get the position of the different character.

Build an array of all positions. Something like this, maybe?
Code: Pascal  [Select][+][-]
  1. var
  2.   Str1 : string = '一二三四五六七八九十';
  3.  
  4.   cpCount: integer = 0; { Code Point Counter }
  5.   cpPos: array of pchar; { Code Point Positions }
  6.  
  7.   p,e: PChar; { Position, End Position }
  8.   sz: Integer;
  9. begin
  10.   SetLength(cpPos, Length(Str1));
  11.   p := @Str1[1];
  12.   e := p+Length(Str1);
  13.   repeat
  14.     cpPos[cpCount] := p;
  15.     sz := UTF8CharacterLengthFast(p);
  16.  
  17.     inc(p, sz);
  18.     inc(cpCount);
  19.   until p>=e;
  20.   SetLength(cpPos, cpCount);
  21. end;

cpPos holds these positions. It uses index counting (starts at 0 for the first code point)

tomitomy

  • Sr. Member
  • ****
  • Posts: 251
Re: How to get the position of different characters in UTF8 strings
« Reply #7 on: November 09, 2017, 07:33:07 am »
Thank you engkin, your code collects all the UTF8 characters' position, but that's not what I need. I want to compare two strings and find the position of the different characters.

Thank you jamie, I tried "unicodestring", and it works well. And I also read the UTF8 documentation, and I found the following method to check the leader byte of UTF8 character.

Code: Pascal  [Select][+][-]
  1. procedure TForm1.FormCreate(Sender: TObject);
  2. var
  3.   i: integer;
  4.   Str: string = '一二三四五六七八九十1234567890';
  5.   b: byte;
  6. begin
  7.   for i := 1 to Str.Length do begin
  8.     b := Ord(Str[i]) shr 6;
  9.     if (b = 3) or (b shr 1 = 0) then
  10.       writeln(Ord(Str[i]), '  Leader byte')
  11.     else
  12.       writeln(Ord(Str[i]), '     Trailing byte');
  13.   end;
  14. end;

Then I wrote the three functions to compare UTF8 strings respectively and test their speed:

Code: Pascal  [Select][+][-]
  1. function UTF8DiffThroughUnicode(Str1, Str2: string): Integer;
  2. var
  3.   iEnd: integer;
  4.   UniStr1: unicodestring;
  5.   UniStr2: unicodestring;
  6. begin
  7.   UniStr1 := Str1;
  8.   UniStr2 := Str2;
  9.  
  10.   if Length(UniStr1) < Length(UniStr2) then
  11.     iEnd := Length(UniStr1)
  12.   else
  13.     iEnd := Length(UniStr2);
  14.  
  15.   Result := 1;
  16.   while (Result <= iEnd) and (UniStr1[Result] = UniStr2[Result]) do
  17.     Inc(Result);
  18.  
  19.   if Result > iEnd then Result := -1;
  20. end;
  21.                        
  22. function UTF8DiffThroughByte(Str1, Str2: string): integer;
  23. var
  24.   iEnd: integer;
  25.   b: byte;
  26. begin
  27.   if Length(Str1) < Length(Str2) then
  28.     iEnd := Length(Str1)
  29.   else
  30.     iEnd := Length(Str2);
  31.  
  32.   Result := 1;
  33.   while (Result <= iEnd) and (Str1[Result] = Str2[Result]) do
  34.     Inc(Result);
  35.  
  36.   if Result > iEnd then Result := -1;
  37.  
  38.   if Result > 0 then begin
  39.     b := Ord(Str1[Result]) shr 6;
  40.     while (b <> 3) and (b shr 1 <> 0) do begin
  41.       Dec(Result);
  42.       b := Ord(Str1[Result]) shr 6;
  43.     end;
  44.     Result := UTF8LengthFast(PChar(Str1), Result - 1) + 1;
  45.   end;
  46. end;
  47.  
  48. function UTF8DiffThroughUTF8(Str1, Str2: string): integer;
  49. var
  50.   iEnd: integer;
  51. begin
  52.   if Length(Str1) < Length(Str2) then
  53.     iEnd := UTF8Length(Str1)
  54.   else
  55.     iEnd := UTF8Length(Str2);
  56.  
  57.   Result := 1;
  58.   while (Result <= iEnd) and (UTF8Copy(Str1, Result, 1) = UTF8Copy(Str2, Result, 1)) do
  59.     Inc(Result);
  60.  
  61.   if Result > iEnd then Result := -1;
  62. end;
  63.  
  64. procedure TForm1.FormCreate(Sender: TObject);
  65. var
  66.   i: integer;
  67.   Stream: TStringStream;
  68.   Str1, Str2: string;
  69.   Time: TDateTime;
  70. begin
  71.   Stream := TStringStream.Create('');
  72.   for i := 1 to 50000 do
  73.     Stream.WriteString('一二三四五六七八九十1234567890');
  74.   Str1 := Stream.DataString + '尾';
  75.   Str2 := Stream.DataString + '巴';
  76.  
  77.   Time := GetTickCount64;
  78.   writeln(UTF8DiffThroughUnicode(Str1, Str2));
  79.   writeln(GetTickCount64 - Time);
  80.  
  81.   Time := GetTickCount64;
  82.   writeln(UTF8DiffThroughByte(Str1, Str2));
  83.   writeln(GetTickCount64 - Time);
  84.  
  85.   // very slow
  86.   //Time := GetTickCount64;
  87.   //writeln(UTF8DiffThroughUTF8(Str1, Str2));
  88.   //writeln(GetTickCount64 - Time);
  89. end;

The test results are as follows: (UTF8DiffThroughUTF8 function is very slow, and I abort it):

Code: [Select]
1000001
 2.1000000000000000E+001
1000001
 7.0000000000000000E+000

According to the test results, I decided to use UTF8DiffThroughByte, Thank you all for your help! :) I will use "unicodestring := Utf8String" elsewhere.

« Last Edit: November 09, 2017, 07:38:39 am by tomitomy »

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4695
  • I like bugs.
Re: How to get the position of different characters in UTF8 strings
« Reply #8 on: November 09, 2017, 12:56:41 pm »
can you use UnicodeString ?
Those are WideStrings that are referenced counted.
 The POS function will work on UnicodeStrings.
 What I have done to ease things is to define a unicodestring := Utf8String;
 Then I use all the common string index functions.
Please stop spreading false information!
WideString and UnicodeString use UTF-16 encoding which is variable width just like UTF-8.
Using fixed indexing works only for UCS-2 which is not really used any more. Even Windows has supported full Unicode since year 2000.

Why is it that programmers usually are careful about their program logic and try to prevent bugs, but with Unicode serious bugs are OK?
Somebody please explain.

Quote
You can always move it back to uft8 afterwards.
 You need to test to see if you lose any charactors using your code page.
What do you mean by code page? I understood the data was Unicode all the time and did not use Windows system codepage.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4695
  • I like bugs.
Re: How to get the position of different characters in UTF8 strings
« Reply #9 on: November 09, 2017, 01:10:39 pm »
Thank you jamie, I tried "unicodestring", and it works well.
It happens to work in this case because all codepoints in your example string belong to Unicode BMP. It does not work as a general solution. Why nobody else alarmed that false advice was given again? There is a strange consensus that buggy code for Unicode data is OK.  :(

Quote
And I also read the UTF8 documentation, and I found the following method to check the leader byte of UTF8 character.
...
According to the test results, I decided to use UTF8DiffThroughByte, Thank you all for your help! :) I will use "unicodestring := Utf8String" elsewhere.

You have written a lot of useless code. There already are helper functions and example code that you could use. See:
 http://wiki.freepascal.org/UTF8_strings_and_characters#Iterating_over_string_analysing_individual_codepoints
It is fast, grows linearly (O(n)).
The code by bylaardt that uses UTF8Length() and UTF8Copy() works but is slow growing polynomially.

Don't do "unicodestring := Utf8String" anywhere. It is totally useless for this particular case.
Instead you could use the enumerators provided in unit LazUnicode :
Code: Pascal  [Select][+][-]
  1. const
  2.   Str1 = '一二三四五六七八九十';
  3.   Str2 = '一二三四五六七十九十';
  4.  
  5. function IdenticalString(const S1, S2 : string): string;
  6. var
  7.   cIter1, cIter2: TCodePointEnumerator;
  8.   HasData1, HasData2: Boolean;
  9. begin
  10.   Result := '';
  11.   cIter1 := TCodePointEnumerator.Create(S1);
  12.   cIter2 := TCodePointEnumerator.Create(S2);
  13.   while True do
  14.   begin
  15.     HasData1 := cIter1.MoveNext;
  16.     HasData2 := cIter2.MoveNext;
  17.     if not (HasData1 and HasData2) or (cIter1.Current <> cIter2.Current) then
  18.       Break;
  19.     Result := Result + cIter1.Current;
  20.   end;
  21.   cIter2.Free;
  22.   cIter1.Free;
  23. end;
  24.  
  25. procedure TForm1.Button1Click(Sender: TObject);
  26. begin
  27.   Memo1.Lines.Add(IdenticalString(Str1, Str2));
  28. end;
(I tested with a Memo). The code is not ideal because Result is concatenated like:
  Result := Result + cIter1.Current;
However such a common case is optimized by compiler.

Even better, you can as easily use TUnicodeCharacterEnumerator and the code works correctly also with Combining Diacritical Marks :
Code: Pascal  [Select][+][-]
  1. function IdenticalString(const S1, S2 : string): string;
  2. var
  3.   cIter1, cIter2: TUnicodeCharacterEnumerator;
  4.   HasData1, HasData2: Boolean;
  5. begin
  6.   Result := '';
  7.   cIter1 := TUnicodeCharacterEnumerator.Create(S1);
  8.   cIter2 := TUnicodeCharacterEnumerator.Create(S2);
  9.   while True do
  10.   begin
  11.     HasData1 := cIter1.MoveNext;
  12.     HasData2 := cIter2.MoveNext;
  13.     if not (HasData1 and HasData2) or (cIter1.Current <> cIter2.Current) then
  14.       Break;
  15.     Result := Result + cIter1.Current;
  16.   end;
  17.   cIter2.Free;
  18.   cIter1.Free;
  19. end;
« Last Edit: November 09, 2017, 02:28:28 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

tomitomy

  • Sr. Member
  • ****
  • Posts: 251
Re: How to get the position of different characters in UTF8 strings
« Reply #10 on: November 09, 2017, 03:05:00 pm »
you could use the enumerators provided in unit LazUnicode :

Thank you JuhaManninen, I read the link you gave me and learned a lot of knowledge, and I tested your code, It's clear and works well, but It's slowly than my code, the test code and the test results are as follows:

Code: Pascal  [Select][+][-]
  1. function UTF8DiffThroughByte(Str1, Str2: string): integer;
  2. var
  3.   iEnd: integer;
  4.   b: byte;
  5. begin
  6.   if Length(Str1) < Length(Str2) then
  7.     iEnd := Length(Str1)
  8.   else
  9.     iEnd := Length(Str2);
  10.  
  11.   Result := 1;
  12.   while (Result <= iEnd) and (Str1[Result] = Str2[Result]) do
  13.     Inc(Result);
  14.  
  15.   if Result > iEnd then Result := -1;
  16.  
  17.   if Result > 0 then begin
  18.     b := Ord(Str1[Result]) shr 6;
  19.     while (b <> 3) and (b shr 1 <> 0) do begin
  20.       Dec(Result);
  21.       b := Ord(Str1[Result]) shr 6;
  22.     end;
  23.     Result := UTF8LengthFast(PChar(Str1), Result - 1) + 1;
  24.   end;
  25. end;
  26.  
  27. function DifferentPos(const S1, S2 : string): integer;
  28. var
  29.   cIter1, cIter2: TCodePointEnumerator;
  30.   HasData1, HasData2: Boolean;
  31. begin
  32.   Result := 0;
  33.   cIter1 := TCodePointEnumerator.Create(S1);
  34.   cIter2 := TCodePointEnumerator.Create(S2);
  35.   while True do
  36.   begin
  37.     HasData1 := cIter1.MoveNext;
  38.     HasData2 := cIter2.MoveNext;  
  39.     Inc(Result);
  40.     if not (HasData1 and HasData2) or (cIter1.Current <> cIter2.Current) then
  41.       Break;
  42.   end;
  43.   cIter2.Free;
  44.   cIter1.Free;
  45.  
  46.   if not (HasData1 and HasData2) then Result := -1;
  47. end;
  48.        
  49. procedure TForm1.Button1Click(Sender: TObject);
  50. var
  51.   i: integer;
  52.   Stream: TStringStream;
  53.   Str1, Str2: string;
  54.   TickCount: QWord;
  55. begin
  56.   Stream := TStringStream.Create('');
  57.   for i := 1 to 50000 do
  58.     Stream.WriteString('一二三四五六七八九十1234567890');
  59.   Str1 := Stream.DataString + '尾';
  60.   Str2 := Stream.DataString + '巴';
  61.   Stream.Free;
  62.  
  63.   TickCount := GetTickCount64;
  64.   writeln('UTF8DiffThroughByte: ');
  65.   writeln(
  66.     '  Result: ',
  67.     UTF8DiffThroughByte(Str1, Str2),
  68.     '  Time Used: ',
  69.     GetTickCount64 - TickCount
  70.   );
  71.  
  72.   TickCount := GetTickCount64;
  73.   writeln('DifferentPos: ');
  74.   writeln(
  75.     '  Result: ',
  76.     DifferentPos(Str1, Str2),
  77.     '  Time Used: ',
  78.     GetTickCount64 - TickCount
  79.   );
  80. end;

Test Result:

Code: [Select]
UTF8DiffThroughByte:
  Result: 1000001  Time Used: 7
DifferentPos:
  Result: 1000001  Time Used: 106

And I need to reverse search from somewhere in the middle of the string, so I need more flexible code, rather than iterating from head, but your code is useful, I will use it elsewhere.
« Last Edit: November 09, 2017, 03:24:27 pm by tomitomy »

tomitomy

  • Sr. Member
  • ****
  • Posts: 251
Re: How to get the position of different characters in UTF8 strings
« Reply #11 on: November 09, 2017, 03:37:58 pm »
This is the code I'm working with, I use it to fix the problem of drag-and-drop text in TMemo, because drag-and-drop text does not give the correct SelStart when Drop text, so I need a quick function to calculate the correct SelStart. (I didn't do enough tests, it looks like working fine)

Code: Pascal  [Select][+][-]
  1. uses
  2. lazUTF8;
  3.  
  4. // First index is 1, last index is Length(Str)
  5. procedure UTF8DiffBytePos(Str1, Str2: string; var Start1, Start2: integer; Reverse: boolean = False);
  6. var
  7.   b: byte;
  8. begin
  9.   if Reverse then begin
  10.     while (Start1 >= 1) and (Start2 >= 1) and (Str1[Start1] = Str2[Start2]) do begin
  11.       Dec(Start1);
  12.       Dec(Start2);
  13.     end;
  14.  
  15.     if Start1 >= 1 then begin
  16.       // Check UTF8 Characror leader byte
  17.       b := Ord(Str1[Start1]) shr 6;
  18.       while (b <> 3) and (b shr 1 <> 0) do begin
  19.         Dec(Start1);
  20.         b := Ord(Str1[Start1]) shr 6;
  21.       end;
  22.     end;
  23.  
  24.     if Start2 >= 1 then begin
  25.       // Check UTF8 Characror leader byte
  26.       b := Ord(Str2[Start2]) shr 6;
  27.       while (b <> 3) and (b shr 1 <> 0) do begin
  28.         Dec(Start2);
  29.         b := Ord(Str2[Start2]) shr 6;
  30.       end;
  31.     end;
  32.   end else begin
  33.     while (Start1 <= Str1.Length) and (Start2 <= Str2.Length) and (Str1[Start1] = Str2[Start2]) do begin
  34.       Inc(Start1);
  35.       Inc(Start2);
  36.     end;
  37.  
  38.     if Start1 <= Str1.Length then begin
  39.       // Check UTF8 Characror leader byte
  40.       b := Ord(Str1[Start1]) shr 6;
  41.       while (b <> 3) and (b shr 1 <> 0) do begin
  42.         Dec(Start1);
  43.         b := Ord(Str1[Start1]) shr 6;
  44.       end;
  45.     end;
  46.  
  47.     if Start2 <= Str2.Length then begin
  48.       // Check UTF8 Characror leader byte
  49.       b := Ord(Str2[Start2]) shr 6;
  50.       while (b <> 3) and (b shr 1 <> 0) do begin
  51.         Dec(Start2);
  52.         b := Ord(Str2[Start2]) shr 6;
  53.       end;
  54.     end;
  55.   end;
  56. end;
  57.        
  58. // First index is 1, last index is UTF8Length(Str)
  59. procedure UTF8Diff(Str1, Str2: string; var Start1, Start2: integer; Reverse: boolean = False);
  60. begin
  61.   if not Reverse then begin
  62.     Dec(Start1);
  63.     Dec(Start2);
  64.   end;
  65.  
  66.   Start1 := UTF8CharToByteIndex(PChar(Str1), Str1.Length, Start1);
  67.   Start2 := UTF8CharToByteIndex(PChar(Str2), Str2.Length, Start2);
  68.  
  69.   if not Reverse then begin
  70.     Inc(Start1);
  71.     Inc(Start2);
  72.   end;
  73.   UTF8DiffBytePos(Str1, Str2, Start1, Start2, Reverse);
  74.  
  75.   if Start1 > 0 then Start1 := UTF8LengthFast(PChar(Str1), Start1 - 1) + 1;
  76.   if Start2 > 0 then Start2 := UTF8LengthFast(PChar(Str2), Start2 - 1) + 1;
  77. end;
  78.  
  79. procedure Test;
  80. var
  81.   Str1: string = '一二三四五六七八九十1234567890一二三四五六七八九十1234567890';
  82.   Str2: string = '五六七八九十1234567890一二三四1234567890';
  83.   Pos1, Pos2: integer;
  84.   i: integer;
  85.   Stream: TStringStream;
  86.   TickCount: QWord;
  87. begin
  88.   Pos1 := 8;
  89.   Pos2 := 4;
  90.  
  91.   UTF8Diff(Str1, Str2, Pos1, Pos2);
  92.   writeln('Different Pos in Str1: ', Pos1);
  93.   writeln('Different Pos in Str2: ', Pos2);
  94.  
  95.   writeln('----------');
  96.  
  97.   Pos1 := 8;
  98.   Pos2 := 4;
  99.  
  100.   UTF8Diff(Str1, Str2, Pos1, Pos2, True);
  101.   writeln('Different Pos in Str1: ', Pos1);
  102.   writeln('Different Pos in Str2: ', Pos2);
  103.            
  104.   writeln('----------');
  105.  
  106.   Stream := TStringStream.Create('');
  107.   for i := 1 to 50000 do
  108.     Stream.WriteString('一二三四五六七八九十1234567890');
  109.   Str1 := Stream.DataString + '尾';
  110.   Str2 := Stream.DataString + '巴';
  111.   Stream.Free;
  112.  
  113.   Pos1 := 1;
  114.   Pos2 := 1;
  115.                                
  116.   TickCount := GetTickCount64;
  117.   UTF8Diff(Str1, Str2, Pos1, Pos2);
  118.   writeln('Time Used: ', GetTickCount64 - TickCount);
  119.   writeln('Different Pos in Str1: ', Pos1);
  120.   writeln('Different Pos in Str2: ', Pos2);
  121. end;
  122.  
  123. { TForm1 }
  124.  
  125. procedure TForm1.FormCreate(Sender: TObject);
  126. begin
  127.   Test;
  128. end;

Running Results:
Code: [Select]
Different Pos in Str1: 25
Different Pos in Str2: 21
----------
Different Pos in Str1: 4
Different Pos in Str2: 0
----------
Time Used: 14
Different Pos in Str1: 1000001
Different Pos in Str2: 1000001

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4695
  • I like bugs.
Re: How to get the position of different characters in UTF8 strings
« Reply #12 on: November 09, 2017, 04:45:25 pm »
Test Result:

Code: [Select]
UTF8DiffThroughByte:
  Result: 1000001  Time Used: 7
DifferentPos:
  Result: 1000001  Time Used: 106
Wow, such a big difference!
I must admit my code was not good for your porposes. It copies every codepoint/character separately to a string which is slow. I will later add an enumerator that works with a PChar and a length thus eliminating a copy, as BeniBela has suggested. Even that would be much slower than your customized solution. You compare byte-by-byte up to the point where the strings differ which is about as fast as it gets.
I think your code is too specialized to be added to LazUnicode, but there could be an example in the wiki.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4695
  • I like bugs.
Re: How to get the position of different characters in UTF8 strings
« Reply #13 on: November 09, 2017, 11:51:17 pm »
@tomitomy, I like to optimize and refactor.
Your code is very optimized already but I refactored one function just for the fun of it, removing some duplicate code.

Code: Pascal  [Select][+][-]
  1. // First index is 1, last index is Length(Str)
  2. procedure UTF8DiffBytePos(Str1, Str2: string; var Start1, Start2: integer; Reverse: boolean = False);
  3.  
  4.   procedure GoToCpStartStr1;
  5.   var
  6.     b: byte;
  7.   begin  // Go to beginning of UTF8 Codepoint in Str1
  8.     while True do begin
  9.       b := Ord(Str1[Start1]) shr 6;
  10.       if (b = 3) or (b shr 1 = 0) then
  11.         break;
  12.       Dec(Start1);
  13.     end;
  14.   end;
  15.  
  16.   procedure GoToCpStartStr2;
  17.   var
  18.     b: byte;
  19.   begin  // Go to beginning of UTF8 Codepoint in Str2
  20.     while True do begin
  21.       b := Ord(Str2[Start2]) shr 6;
  22.       if (b = 3) or (b shr 1 = 0) then
  23.         break;
  24.       Dec(Start2);
  25.     end;
  26.   end;
  27.  
  28. begin
  29.   if Reverse then begin
  30.     while (Start1 >= 1) and (Start2 >= 1) and (Str1[Start1] = Str2[Start2]) do begin
  31.       Dec(Start1);
  32.       Dec(Start2);
  33.     end;
  34.     if Start1 > 1 then
  35.       GoToCpStartStr1;
  36.     if Start2 > 1 then
  37.       GoToCpStartStr2;
  38.   end else begin
  39.     while (Start1 <= Str1.Length) and (Start2 <= Str2.Length) and (Str1[Start1] = Str2[Start2]) do begin
  40.       Inc(Start1);
  41.       Inc(Start2);
  42.     end;
  43.     if Start1 <= Str1.Length then
  44.       GoToCpStartStr1;
  45.     if Start2 <= Str2.Length then
  46.       GoToCpStartStr2;
  47.   end;
  48. end;
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

tomitomy

  • Sr. Member
  • ****
  • Posts: 251
Re: How to get the position of different characters in UTF8 strings
« Reply #14 on: November 10, 2017, 01:57:11 am »
Thank you JuhaManninen, I like clear code, your code is clear, I like it. :)

 

TinyPortal © 2005-2018