Recent

Author Topic: Searching for words in a UTF8 string  (Read 12801 times)

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4689
  • I like bugs.
Re: Searching for words in a UTF8 string
« Reply #15 on: August 17, 2016, 02:47:05 pm »
@fedkad and others, please test my search function that iterates codepoints and return codepoint positions instead of byte positions.
This was an experiment. I am not sure if the approach is the best possible.
The optimizations require pointers and a C-style function strlcomp().
The code could be easily ported to Delphi by replacing UTF8CharacterLength() with CodePointSize() from unit LazUnicode.

I tested only with short text. I am interested in the performance of this code with huge amounts of text, if somebody wants to test it.
Maybe somebody finds ways to optimize it further.

Please remember that the code only deals with codepoints, not with "Unicode characters" which include combined codepoints. The combining rules are complex and do not depend on encoding.

Code: Pascal  [Select][+][-]
  1. unit uSearch;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. interface
  6.  
  7. uses
  8.   SysUtils, fgl, LazUTF8;
  9.  
  10. type
  11.   TIntList = specialize TFPGList<Integer>;
  12.  
  13. procedure SearchInWordBoundaries(const SubStr, S: string; BegOfWord, EndOfWord: Boolean;
  14.                                  Results: TIntList);
  15.  
  16. implementation
  17.  
  18. procedure SearchInWordBoundaries(const SubStr, S: string; BegOfWord, EndOfWord: Boolean;
  19.                                  Results: TIntList);
  20. const
  21.   cBlanks : set of Char = [#9,#10,#12,#13,#14,#32];
  22. var
  23.   FirstCh: Char;
  24.   i, CPPos: Integer;  // Current byte and codepoint positions.
  25.   CPSize: Integer;    // Size of one codepoint.
  26.   SubEnd: Integer;    // End of search string.
  27.   BegMatch, EndMatch: Boolean;
  28. begin
  29.   Assert(SubStr<>'', 'SearchInWordBoundaries: SubStr is empty.');
  30.   FirstCh := SubStr[1];
  31.   i := 1;
  32.   CPPos := 1;
  33.   BegMatch := True;
  34.   EndMatch := True;
  35.   while i <= length(S) do
  36.   begin
  37.     CPSize := UTF8CharacterLength(@S[i]);
  38.     // Match in beginning of word.
  39.     if BegOfWord then
  40.       BegMatch := (i = 1) or (S[i-1] in cBlanks);
  41.     // Match in end of word.
  42.     if EndOfWord then
  43.     begin
  44.       SubEnd := i + Length(SubStr);
  45.       EndMatch := (SubEnd > Length(S)) or (S[SubEnd] in cBlanks);
  46.     end;
  47.     // Compare
  48.     if BegMatch and EndMatch
  49.     and (S[i] = FirstCh) // Optimization: compare first char before calling strlcomp.
  50.     and (strlcomp(@S[i], @SubStr[1], Length(SubStr)) = 0)
  51.     then
  52.       Results.Add(CPPos);
  53.     inc(i, CPSize);
  54.     inc(CPPos);
  55.   end;
  56. end;
  57.  
  58. end.

I tested it in a form with a Memo and Checkboxes:

Code: Pascal  [Select][+][-]
  1.   SearchResult := TIntList.Create;
  2.   Memo1.Clear;
  3.   Memo1.Lines.Add('Codepoint positions:');
  4.   SearchInWordBoundaries('βγδ', 'αβγδεζηθ βγδ βγδεζηθ',
  5.             cbSearchBegOfWord.Checked, cbSearchEndOfWord.Checked, SearchResult);
  6.   for i := 0 to SearchResult.Count-1 do
  7.     Memo1.Lines.Add(' ' + IntToStr(SearchResult[i]));
  8.   SearchResult.Free;
  9.  
« Last Edit: August 17, 2016, 02:57:04 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Searching for words in a UTF8 string
« Reply #16 on: August 18, 2016, 11:52:56 pm »
I am interested in the performance of this code with huge amounts of text, if somebody wants to test it.
Maybe somebody finds ways to optimize it further.

UTF8CharacterLength returns zero for invalid UTF8 codepoints and SearchInWordBoundaries will loop forever. Since UTF8CharacterLength is used for every codepoint, it seems that replacing it with a faster code should increase the speed.

We can find the size of a UTF8 codepoint from its first byte. This means we can get rid of UTF8CharacterLength and use a simple array:
Code: Pascal  [Select][+][-]
  1. var
  2.   UTF8CPSizeArray: array[ansichar] of integer;
  3. ...
  4. procedure BuildUTF8CPSizeArray;
  5. var
  6.   i, CPSize: byte;
  7. begin
  8.  for i := 0 to 255 do
  9.  begin
  10.    CPSize := bsrbyte(not(i));
  11.    if (CPSize>6) then
  12.      UTF8CPSizeArray[ansichar(i)] := 1
  13.    else
  14.      UTF8CPSizeArray[ansichar(i)] := 7-CPSize;
  15.  end;
  16. end;

or directly:
Code: Pascal  [Select][+][-]
  1.   UTF8CPSizeArray: array[ansichar] of integer =
  2.   (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
  3.    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
  4.    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
  5.    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
  6.    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
  7.    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
  8.    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
  9.    1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
  10.    2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4,
  11.    4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 7, 1);

On an older computer, the new SearchInWordBoundaries is faster by around %20 to %50 (depends on BegOfWord and EndOfWord) using string from a few MB text file.
Code: Pascal  [Select][+][-]
  1. procedure SearchInWordBoundaries(const SubStr, S: string; BegOfWord, EndOfWord: Boolean;
  2.                                  Results: TIntList);
  3. const
  4.   cBlanks : set of Char = [#9,#10,#12,#13,#14,#32];
  5. var
  6.   FirstCh: Char;
  7.   i, CPPos: Integer;  // Current byte and codepoint positions.
  8.   SubEnd: Integer;    // End of search string.
  9.   BegMatch, EndMatch: Boolean;
  10. begin
  11.   Assert(SubStr<>'', 'SearchInWordBoundaries: SubStr is empty.');
  12.   FirstCh := SubStr[1];
  13.   i := 1;
  14.   CPPos := 1;
  15.   BegMatch := True;
  16.   EndMatch := True;
  17.   while i <= length(S) do
  18.   begin
  19.     // Match in beginning of word.
  20.     if BegOfWord then
  21.       BegMatch := (i = 1) or (S[i-1] in cBlanks);
  22.     // Match in end of word.
  23.     if EndOfWord then
  24.     begin
  25.       SubEnd := i + Length(SubStr);
  26.       EndMatch := (SubEnd > Length(S)) or (S[SubEnd] in cBlanks);
  27.     end;
  28.     // Compare
  29.     if BegMatch and EndMatch
  30.     and (S[i] = FirstCh) // Optimization: compare first char before calling strlcomp.
  31.     and (strlcomp(@S[i], @SubStr[1], Length(SubStr)) = 0)
  32.     then
  33.     begin
  34.       Results.Add(CPPos);
  35.     end;
  36.     inc(i, UTF8CPSizeArray[S[i]]);  //<-------
  37.     inc(CPPos);
  38.   end;
  39. end;
  40.  

Simple benchmark results where SearchInWordBoundaries[5] is using UTF8CPSizeArray:
Quote
BegMatch: FALSE, EndMatch: FALSE
SearchInWordBoundaries[0] : 711509258 ticks
SearchInWordBoundaries[5] : 287824056 ticks


BegMatch: TRUE, EndMatch: FALSE
SearchInWordBoundaries[0] : 861294774 ticks
SearchInWordBoundaries[5] : 552766946 ticks


BegMatch: FALSE, EndMatch: TRUE
SearchInWordBoundaries[0] : 1041616080 ticks
SearchInWordBoundaries[5] : 661551589 ticks


BegMatch: TRUE, EndMatch: TRUE
SearchInWordBoundaries[0] : 1211009940 ticks
SearchInWordBoundaries[5] : 1012954865 ticks

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4689
  • I like bugs.
Re: Searching for words in a UTF8 string
« Reply #17 on: August 19, 2016, 05:30:29 pm »
UTF8CharacterLength returns zero for invalid UTF8 codepoints and SearchInWordBoundaries will loop forever.

Looking at the code ... no it does not.
It returns zero only when its PChar parameter = Nil. For invalid UTF8 it returns 1.

Quote
On an older computer, the new SearchInWordBoundaries is faster by around %20 to %50 (depends on BegOfWord and EndOfWord) using string from a few MB text file.

Cool! We must get a faster function in LazUTF8 unit, too.
Either by replacing UTF8CharacterLength or by adding a new function.
« Last Edit: August 19, 2016, 05:43:21 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Ondrej Pokorny

  • Full Member
  • ***
  • Posts: 220
Re: Searching for words in a UTF8 string
« Reply #18 on: August 19, 2016, 10:31:34 pm »
AFAIR, a CASE statement is as  performant as an array and doesn't need to allocate so much memory.

Instead of
Code: [Select]
inc(i, UTF8CPSizeArray[S[i]]);
you can use:
Code: [Select]
case S[I] of
  0..127 [?]: Inc(I);
  128[?]..[?]: Inc(I, 2);
  [?]..[?]: Inc(I,  3);

  etc etc
(please set the correct ranges from UTF8CPSizeArray into the case statement.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Searching for words in a UTF8 string
« Reply #19 on: August 20, 2016, 12:39:32 am »
UTF8CharacterLength returns zero for invalid UTF8 codepoints and SearchInWordBoundaries will loop forever.

Looking at the code ... no it does not.
It returns zero only when its PChar parameter = Nil. For invalid UTF8 it returns 1.

You are right, somehow I got the wrong value.

AFAIR, a CASE statement is as  performant as an array and doesn't need to allocate so much memory.

I hope so.  :)

Instead of
Code: [Select]
inc(i, UTF8CPSizeArray[S[i]]);
you can use:
Code: [Select]
case S[I] of
  0..127 [?]: Inc(I);
  128[?]..[?]: Inc(I, 2);
  [?]..[?]: Inc(I,  3);

  etc etc
(please set the correct ranges from UTF8CPSizeArray into the case statement.

I'll test and get back with the results.

Edit:
You are RIGHT!!

Using CASE turned out to be faster than using the array. The following code is equivalent to using the array:
Code: Pascal  [Select][+][-]
  1.     case S[i] of
  2.       #0  ..#191,#255: Inc(I);
  3.       #192..#223: Inc(I, 2);
  4.       #224..#239: Inc(I,  3);
  5.       #240..#247: Inc(I,  4);
  6.       #248..#251: Inc(I,  5);
  7.       #252,#253: Inc(I,  6);
  8.       #254: Inc(I,  7);
  9.     end;
« Last Edit: August 20, 2016, 01:05:57 am by engkin »

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4689
  • I like bugs.
Re: Searching for words in a UTF8 string
« Reply #20 on: August 21, 2016, 06:42:01 pm »
Using CASE turned out to be faster than using the array. The following code is equivalent to using the array:
Code: Pascal  [Select][+][-]
  1.     case S[i] of
  2.       #0  ..#191,#255: Inc(I);
  3.       #192..#223: Inc(I, 2);
  4.       #224..#239: Inc(I,  3);
  5.       #240..#247: Inc(I,  4);
  6.       #248..#251: Inc(I,  5);
  7.       #252,#253: Inc(I,  6);
  8.       #254: Inc(I,  7);
  9.     end;
When using it for a function, it should return 0 for char #0. Am I right?
I committed such a function named UTF8CharacterLengthFast in r52853. It is inlined, thus very fast. Please test.
This is how it looks like:
Code: Pascal  [Select][+][-]
  1. function UTF8CharacterLengthFast(p: PChar): integer;
  2. begin
  3.   case p^ of
  4.     #0           : Result := 0;
  5.     #1..#191,#255: Result := 1;
  6.     #192..#223   : Result := 2;
  7.     #224..#239   : Result := 3;
  8.     #240..#247   : Result := 4;
  9.     #248..#251   : Result := 5;
  10.     #252, #253   : Result := 6;
  11.     #254         : Result := 7;
  12.   end;
  13. end;
« Last Edit: August 21, 2016, 07:16:19 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Searching for words in a UTF8 string
« Reply #21 on: August 21, 2016, 08:12:11 pm »
When using it for a function, it should return 0 for char #0. Am I right?

Not according to the source code of UTF8CharacterLength ;-) :
Code: Pascal  [Select][+][-]
  1. function UTF8CharacterLength(p: PChar): integer;
  2. begin
  3.   if p<>nil then begin
  4.     if ord(p^)<%11000000 then begin
  5.       // regular single byte character (#0 is a character, this is pascal ;)
  6.       Result:=1;

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4689
  • I like bugs.
Re: Searching for words in a UTF8 string
« Reply #22 on: August 21, 2016, 09:50:28 pm »
Not according to the source code of UTF8CharacterLength ;-) :

Right, I did not pay attention. I fixed it in r52856.
Thanks.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

 

TinyPortal © 2005-2018