What you need is a little knowledge about wheels, they happen to be round after all
: Boyer-Moore
There's plenty of code on the internet that does implement Boyer-Moore search for Delphi or Freepascal.
Boyer-Moore may not help when he wants the word boundary check. Or maybe an "extended Boyer-Moore" algorithm should be created.
Also, Boyer-Moore may not help if a codepoint position is needed.
However, this has a serious drawback. Assume that are a million (non-word-boundary) matches, but only one word boundary match. In such a scenario, the if checks in the above code will run millions of times, but only once they will return "found=true", which is sooo slow.
I believe it is slow because you used utf8Length(), utf8Copy() and utf8Pos() shamelessly inside a loop! They are inherently slow.
I will try also Fungus' suggestion, provided it supports UTF8 strings.
You mean the function
SearchInWordBoundaries which was improved from GetMem's (and my) code? Yes, it fully supports Unicode.
I hope you will get a similar "Heureka" moment than I got, realizing how often byte offsets can be used with UTF-8 code. Or, in more general terms, realizing how often codeunit offsets can be used with Unicode aware code. The same concept applies to UTF-16 as well.
Please see the examples here again and think
why the fast Pos(), Copy() and Length() work so well:
http://wiki.freepascal.org/UTF8_strings_and_charactersThe "secret" is to use String type also for single codepoints.
A question is, do you need codepoint offsets or are the byte offsets enough? For example if you copy text near the found patterns then byte offsets should be enough. The code is very fast.
If you really need codepoint offsets then you should rewrite the code to iterate codepoints but still use the fast Pos(), Copy() and Length() as much as possible.
That is a nice optimization challenge.
