Recent

Author Topic: Searching for words in a UTF8 string  (Read 12891 times)

fedkad

  • Full Member
  • ***
  • Posts: 178
Searching for words in a UTF8 string
« on: August 15, 2016, 11:46:45 am »
I want to search a UTF8 string within another UTF8 string with word boundaries (the search string must match beginning of word, end of word, or both).

SearchBuf was some options close to some of my needs, but it does NOT work with UTF8 strings. (Try to search the string 'βγδ' with the option [soWholeWord,soDown] in a string like 'αβγδεζηθ βγδ αβγδεζηθ', and it will match the first (left-most) occurrence.)

Is there a reliable solution to this?
Lazarus 4.0 / FPC 3.2.2 on x86_64-linux-gtk2 (Ubuntu/GNOME) and x86_64-win64-win32/win64 (Windows 11)

Thaddy

  • Hero Member
  • *****
  • Posts: 18911
  • Glad to be alive.
Re: Searching for words in a UTF8 string
« Reply #1 on: August 15, 2016, 12:41:54 pm »
It is called Pos()
Code: Pascal  [Select][+][-]
  1. program untitled;
  2. var
  3.   a,b:UTF8String;
  4. begin
  5.   a := 'Thaddy';b:='dd';
  6.   writeln(Pos(b,a));
  7. end.
« Last Edit: August 15, 2016, 01:04:29 pm by Thaddy »
Recovered from removal of tumor in tongue following tongue reconstruction with a part from my leg.

balazsszekely

  • Guest
Re: Searching for words in a UTF8 string
« Reply #2 on: August 15, 2016, 02:13:48 pm »
What about multiple occurrences? I would go with PosEx instead:
Code: Pascal  [Select][+][-]
  1. program untitled;
  2. uses strutils;
  3. var
  4.   a, b: UTF8String;
  5.   p, k: Integer;
  6. begin
  7.   a := 'ThaddyThaddy';
  8.   b := 'dd';
  9.   k := 1;
  10.   repeat
  11.     p := PosEx(b, a, k);
  12.     if p <> 0 then
  13.     begin
  14.       writeln(p);
  15.       k := p + Length(b);
  16.     end;
  17.   until p = 0;
  18.   readln;
  19. end.

Thaddy

  • Hero Member
  • *****
  • Posts: 18911
  • Glad to be alive.
Re: Searching for words in a UTF8 string
« Reply #3 on: August 15, 2016, 02:28:34 pm »
Yes, good idea. Wasn't clear from the question to me, though.
Recovered from removal of tumor in tongue following tongue reconstruction with a part from my leg.

Fungus

  • Sr. Member
  • ****
  • Posts: 354
Re: Searching for words in a UTF8 string
« Reply #4 on: August 15, 2016, 03:32:30 pm »
In order to find words beginning or ending with a phrase you must use regular expressions: http://wiki.freepascal.org/Regexpr Pos[Ex] will return any matches - also in the middle of words.

Thaddy

  • Hero Member
  • *****
  • Posts: 18911
  • Glad to be alive.
Re: Searching for words in a UTF8 string
« Reply #5 on: August 15, 2016, 03:39:36 pm »
In order to find words beginning or ending with a phrase you must use regular expressions: http://wiki.freepascal.org/Regexpr Pos[Ex] will return any matches - also in the middle of words.
Correction: you CAN - not must - use Regular Expressions. If the delimiter is defined you can also do it with pos, but anyway: provide us with the RegExpr ;)
Recovered from removal of tumor in tongue following tongue reconstruction with a part from my leg.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4709
  • I like bugs.
Re: Searching for words in a UTF8 string
« Reply #6 on: August 15, 2016, 03:53:41 pm »
In order to find words beginning or ending with a phrase you must use regular expressions: http://wiki.freepascal.org/Regexpr Pos[Ex] will return any matches - also in the middle of words.
Not really. This code handles the word boundaries:
Code: Pascal  [Select][+][-]
  1. procedure SearchInWordBoundaries(const SubStr, S: string);
  2. var
  3.   p, k: Integer;
  4. begin
  5.   k := 1;
  6.   repeat
  7.     p := PosEx(SubStr, S, k);
  8.     if p <> 0 then
  9.     begin
  10.       k := p + Length(SubStr);
  11.       if (p = 1) or (S[p-1] = ' ') or (k > Length(S)) or (S[k] = ' ') then
  12.         writeln(p);  // or maybe add it to a list
  13.     end;
  14.   until p = 0;
  15. end;
  16.  
Then call it like:
Code: Pascal  [Select][+][-]
  1. SearchInWordBoundaries('βγδ', 'αβγδεζηθ βγδ αβγδεζηθ');

The found values are byte offsets. If you need codepoint offsets, use the codepoint specific functions instead.

GetMem and Thaddy, you should not use UTF8String type although the encoding is UTF-8. It is explained here:
 http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#Why_not_use_UTF8String_in_Lazarus.3F
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Thaddy

  • Hero Member
  • *****
  • Posts: 18911
  • Glad to be alive.
Re: Searching for words in a UTF8 string
« Reply #7 on: August 15, 2016, 04:03:30 pm »
@Juha
I know but I am still not quite on solid ground regarding the string type that is actually used in Lazarus  O:-)
I'll get over it ;)
Recovered from removal of tumor in tongue following tongue reconstruction with a part from my leg.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4709
  • I like bugs.
Re: Searching for words in a UTF8 string
« Reply #8 on: August 15, 2016, 04:31:58 pm »
I know but I am still not quite on solid ground regarding the string type that is actually used in Lazarus  O:-)

Yes, it is not perfect and breaks code when you need the old Windows system codepages.
The good news is that everything works as magic by using String type when all your text is Unicode.
With little effort (and LazUnicode unit) it is also possible to write fully portable robust code between Lazarus and Delphi.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Fungus

  • Sr. Member
  • ****
  • Posts: 354
Re: Searching for words in a UTF8 string
« Reply #9 on: August 15, 2016, 09:09:25 pm »
This could be spiced up a bit:

Code: Pascal  [Select][+][-]
  1. procedure SearchInWordBoundaries(const SubStr, S: string);
  2. const cBlanks : set of Char = [#9,#10,#12,#13,#14,#32];
  3. var
  4.   p, k: Integer;
  5. begin
  6.   k := 1;
  7.   repeat
  8.     p := PosEx(SubStr, S, k);
  9.     if p <> 0 then
  10.     begin
  11.       k := p + Length(SubStr);
  12.       if (p = 1) or (S[p-1] in cBlanks) or (k > Length(S)) or (S[k] in cBlanks) then
  13.         writeln(p);  // or maybe add it to a list
  14.     end;
  15.   until p = 0;
  16. end;

You could also use regular expressions - it has it's advantages and it's drawbacks:

Code: Pascal  [Select][+][-]
  1. procedure FindWordsWithRegExpr;
  2. var S, T, R: string;
  3. begin
  4.  
  5.   //This is what we are looking for
  6.   S:= 'th';
  7.  
  8.   //This is what we are looking in
  9.   T:= 'This is a test string with no math that will yield three plus two matches!';
  10.  
  11.   //This will become our result
  12.   R:= '';
  13.  
  14.   //Words starting with S
  15.   with TRegExpr.Create('(^|[ \t\r\n\v\f])' + S) do try
  16.      ModifierG:= True; //Global / find all
  17.      ModifierI:= True; //Case insensitive
  18.      if Exec(T) then repeat
  19.        R:= R + 'Word starting with "' + S + '" found: ' + IntToStr(MatchPos[0]) + ' (' + Match[0] + ')' + #13#10;
  20.      until not ExecNext;
  21.   finally
  22.     Free;
  23.   end;
  24.  
  25.   //Words ending with S
  26.   with TRegExpr.Create(S + '($|[ \t\r\n\v\f])') do try
  27.      ModifierG:= True; //Global / find all
  28.      ModifierI:= True; //Case insensitive
  29.      if Exec(T) then repeat
  30.        R:= R + 'Word ending with "' + S + '" found: ' + IntToStr(MatchPos[0]) + ' (' + Match[0] + ')' + #13#10;
  31.      until not ExecNext;
  32.   finally
  33.     Free;
  34.   end;
  35.  
  36.   ShowMessage(R);
  37. end;

Major drawback is that the positions returned must be processed to get accurate string positions.

fedkad

  • Full Member
  • ***
  • Posts: 178
Re: Searching for words in a UTF8 string
« Reply #10 on: August 16, 2016, 10:35:26 am »
Thank all for your replies!

My code was already using utf8pos. It was something like this (very similar to the suggested one):

Code: Pascal  [Select][+][-]
  1. wordchecks := 0;
  2. pos := utf8Pos(ss, s, pos+1); // pos: last position where we left off...
  3. found := false;
  4. while pos > 0 do
  5. begin
  6.   begOK := true;
  7.   if cbSearchBegOfWord.Checked then
  8.   begin
  9.     wordchecks := wordchecks+1; //??
  10.     if pos>1 then
  11.       begOK := is_delimiter (utf8copy(s,pos-1,1));
  12.   end;
  13.   endOK := true;
  14.   if cbSearchEndOfWord.Checked then
  15.   begin
  16.     wordchecks := wordchecks+1; //??
  17.     if pos+utf8Length(ss) < utf8Length(s) then
  18.       endOK := is_delimiter (utf8copy(s,pos+utf8Length(ss),1));
  19.   end;
  20.   if begOK and endOK then
  21.   begin
  22.     found := true;
  23.     break
  24.   end;
  25.  
  26.   if wordchecks > 10000 then  //??
  27.   begin
  28.     showmessage('Too many word boundary checks ('+inttostr(wordchecks)+')!');
  29.     break
  30.   end;
  31.  
  32.   pos := utf8Pos(ss, s, pos+1);
  33. end; // while
  34.  
  35. if found then
  36.   ...
  37.  


However, this has a serious drawback. Assume that are a million (non-word-boundary) matches, but only one word boundary match. In such a scenario, the if checks in the above code will run millions of times, but only once they will return "found=true", which is sooo slow.

I will try also Fungus' suggestion, provided it supports UTF8 strings.

« Last Edit: August 16, 2016, 03:43:12 pm by fedkad »
Lazarus 4.0 / FPC 3.2.2 on x86_64-linux-gtk2 (Ubuntu/GNOME) and x86_64-win64-win32/win64 (Windows 11)

Thaddy

  • Hero Member
  • *****
  • Posts: 18911
  • Glad to be alive.
Re: Searching for words in a UTF8 string
« Reply #11 on: August 16, 2016, 03:01:12 pm »
What you need is a little knowledge about wheels, they happen to be round after all ;) : Boyer-Moore
There's plenty of code on the internet that does implement Boyer-Moore search for Delphi or Freepascal.
https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
Here's one but not necessaraly the best, just as example:
http://delphidabbler.com/tips/41
« Last Edit: August 16, 2016, 03:07:10 pm by Thaddy »
Recovered from removal of tumor in tongue following tongue reconstruction with a part from my leg.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4709
  • I like bugs.
Re: Searching for words in a UTF8 string
« Reply #12 on: August 16, 2016, 04:11:00 pm »
What you need is a little knowledge about wheels, they happen to be round after all ;) : Boyer-Moore
There's plenty of code on the internet that does implement Boyer-Moore search for Delphi or Freepascal.

Boyer-Moore may not help when he wants the word boundary check. Or maybe an "extended Boyer-Moore" algorithm should be created.
Also, Boyer-Moore may not help if a codepoint position is needed.

However, this has a serious drawback. Assume that are a million (non-word-boundary) matches, but only one word boundary match. In such a scenario, the if checks in the above code will run millions of times, but only once they will return "found=true", which is sooo slow.

I believe it is slow because you used utf8Length(), utf8Copy() and utf8Pos() shamelessly inside a loop! They are inherently slow.

Quote
I will try also Fungus' suggestion, provided it supports UTF8 strings.

You mean the function SearchInWordBoundaries which was improved from GetMem's (and my) code? Yes, it fully supports Unicode.
I hope you will get a similar "Heureka" moment than I got, realizing how often byte offsets can be used with UTF-8 code. Or, in more general terms, realizing how often codeunit offsets can be used with Unicode aware code. The same concept applies to UTF-16 as well.
Please see the examples here again and think why the fast Pos(), Copy() and Length() work so well:
 http://wiki.freepascal.org/UTF8_strings_and_characters
The "secret" is to use String type also for single codepoints.

A question is, do you need codepoint offsets or are the byte offsets enough? For example if you copy text near the found patterns then byte offsets should be enough. The code is very fast.
If you really need codepoint offsets then you should rewrite the code to iterate codepoints but still use the fast Pos(), Copy() and Length() as much as possible.
That is a nice optimization challenge. :)
« Last Edit: August 16, 2016, 05:06:02 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Searching for words in a UTF8 string
« Reply #13 on: August 16, 2016, 05:22:18 pm »
However, this has a serious drawback. Assume that are a million (non-word-boundary) matches, but only one word boundary match. In such a scenario, the if checks in the above code will run millions of times, but only once they will return "found=true", which is sooo slow.

I believe it is slow because you used utf8Length(), utf8Copy() and utf8Pos() shamelessly inside a loop! They are inherently slow.

Using cbSearchBegOfWord.Checked & cbSearchEndOfWord.Checked inside that loop will make it slow as well.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4709
  • I like bugs.
Re: Searching for words in a UTF8 string
« Reply #14 on: August 16, 2016, 05:37:37 pm »
Using cbSearchBegOfWord.Checked & cbSearchEndOfWord.Checked inside that loop will make it slow as well.

Yes, try to separate logic and GUI from each other.
fedkad, your code is a good example of how it should not be done. :)
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

 

TinyPortal © 2005-2018