Recent

Author Topic: A faster alternative to PosEx?  (Read 7191 times)

Solstice

  • New member
  • *
  • Posts: 8
A faster alternative to PosEx?
« on: September 17, 2020, 01:54:41 pm »
Greetings!

I've written a general exact pattern matching algorithm that beats PosEx pretty easily,
and I've been wondering if there are faster implementations of PosEx out there.

My code isn't restricted to letters etc, but PosEx is the only thing I know that's actually fast.
I have nothing else to compare my code's performance against.

Everything I've found is inferior to PosEx.

I've tried regexpr,
StrStr (winapi),
FindMatchesBoyerMooreCaseSensitive, (which is implemented in FPC by default)
and some apparently rather badly written Rabin-Karp implementation, which is slow.

Is there anything out there that's faster than PosEx and easy enough to use so I can compare it to my own code?
I know there's plenty of C code on github related to the topic, but I don't know C at all.
The time spent with this might be far too much than it's worth.

What I need is a way to benchmark my code in comparison to others.

Thanks!
« Last Edit: September 17, 2020, 02:36:39 pm by Solstice »

Bi0T1N

  • Jr. Member
  • **
  • Posts: 85
Re: A faster alternative to PosEx?
« Reply #1 on: September 17, 2020, 04:44:56 pm »
If you're searching for other PosEx implementations to compare with, check out mORMot's implementation or the FastcodePosExUnit. Maybe also compare different code pages.
For RegEx you should have a look at FLRE.
Otherwise you could compare against CompareMem where mORMot also offers a fast version.

Solstice

  • New member
  • *
  • Posts: 8
Re: A faster alternative to PosEx?
« Reply #2 on: September 17, 2020, 04:48:23 pm »
Thanks, I'll check them out!

Thaddy

  • Hero Member
  • *****
  • Posts: 18503
  • Here stood a man who saw the Elbe and jumped it.
Re: A faster alternative to PosEx?
« Reply #3 on: September 17, 2020, 04:50:37 pm »
Pascal strings allow for embedded zero's.....You did not test that, did you?
« Last Edit: September 17, 2020, 04:52:49 pm by Thaddy »
Due to censorship, I changed this to "Nelly the Elephant". Keeps the message clear.

Solstice

  • New member
  • *
  • Posts: 8
Re: A faster alternative to PosEx?
« Reply #4 on: September 17, 2020, 05:03:30 pm »
I'm not sure what you mean or why you bring it up.
Could you explain?

The Needle I'm searching for is a regular String for both my own code and PosEx.

The time-difference on my machine, between my code and PosEx,
for an array of two gigabytes in size, is roughly 400 milliseconds.

For short Haystacks, like 1024 elements, it's twice as fast as PosEx.
I haven't made a proper benchmark yet,
exactly because I hoped to find something that's faster than PosEx that I could benchmark my code against.

Considering the difference, do you think it matters?

I admit, I might be doing it all wrong. I've written a general exact-pattern-matching routine
and PosEx is limited to only ASCII (I think), but I really don't know what other function (library?) I could use instead.

PosEx is simply the only thing I know (or can find) that's for pascal, easy to use and fast.
« Last Edit: September 17, 2020, 05:07:12 pm by Solstice »

Solstice

  • New member
  • *
  • Posts: 8
Re: A faster alternative to PosEx?
« Reply #5 on: September 17, 2020, 05:26:47 pm »
If you're searching for other PosEx implementations to compare with, check out mORMot's implementation or the FastcodePosExUnit. Maybe also compare different code pages.
For RegEx you should have a look at FLRE.
Otherwise you could compare against CompareMem where mORMot also offers a fast version.

Wow, people write messy code.

The FastcodePos... needs to be rewritten, because it doesn't compile for 64bit.
I had a look at both codes and ... it's a mess :D ... but that doesn't matter.

I believe the sole reason, why PosEx is so fast, is because it uses the heavily optimized IndexByte function,
asm/SSE, so these can't actually keep up. FastCode... has some 32bit assembly functions,
but they're all riddled with branches. This isn't going to be faster than the current PosEx in FPC,
which means it's also not going to be faster than my own code.

Thanks, though!
« Last Edit: September 17, 2020, 05:30:48 pm by Solstice »

Thaddy

  • Hero Member
  • *****
  • Posts: 18503
  • Here stood a man who saw the Elbe and jumped it.
Re: A faster alternative to PosEx?
« Reply #6 on: September 17, 2020, 05:58:13 pm »
I'm not sure what you mean or why you bring it up.
Could you explain?
Of course.
'whathever'#0'never'#0'mind' is a valid Pascal string, so using C derived functions - like strstr, strlen and family - will fail.
Consider this full program:
Code: Pascal  [Select][+][-]
  1. begin
  2.    writeln('whathever'#0'never'#0'mind');
  3.    writeln(strlen('whaterever'#0'never'#0'mind'));
  4. end.
Where is 'never' or for that matter 'mind'?
C'sims are highly dangerous and pure pascal is often faster with strings.
« Last Edit: September 17, 2020, 06:11:48 pm by Thaddy »
Due to censorship, I changed this to "Nelly the Elephant". Keeps the message clear.

Solstice

  • New member
  • *
  • Posts: 8
Re: A faster alternative to PosEx?
« Reply #7 on: September 17, 2020, 06:09:29 pm »
I've found this site.

http://rvelthuis.blogspot.com/2018/01/strings-on-other-platforms-than-32-bit.html,

It says ...

Quote
One very obvious example of this is the Pos function, which searches if a certain string (I call that the Needle) can be found in a larger one (the Haystack). The Win32 implementation is in highly optimized assembler, written by Aleksandr Sharahov from the FastCode project, and licensed by CodeGear. The Win64 implementation is in plain Pascal (PUREPASCAL). But the implementation for UnicodeString is not the same, or even similar, to the implementation for AnsiString!

Apparently it's this function: https://www.delphipraxis.net/605526-post9.html

But that's not faster than FPC's own PosEx, which I believe is actually written by the same guy,
but I don't know. PosEx uses IndexByte, which can be found here: https://github.com/graemeg/freepascal/blob/master/rtl/x86_64/x86_64.inc

PosEx: https://github.com/graemeg/freepascal/blob/master/packages/rtl-objpas/src/inc/strutils.pp

I'm beginning to wonder if there isn't actually anything faster out there. I mean,
if there would, why would it not be implemented?

Solstice

  • New member
  • *
  • Posts: 8
Re: A faster alternative to PosEx?
« Reply #8 on: September 17, 2020, 06:10:43 pm »
I'm not sure what you mean or why you bring it up.
Could you explain?
Of course.
'whatherever'#0'never'#0'mind' is a valid Pascal string, so using C derived functions - like strstr, strlen and family - will fail.
Consider this full program:
Code: Pascal  [Select][+][-]
  1. begin
  2.    writeln('whatherever'#0'never'#0'mind');
  3.    writeln(strlen('whatherever'#0'never'#0'mind'));
  4. end.
Where is 'never' or for that matter 'mind'?
C'sims are highly dangerous and pure pascal is often faster with strings.

Ah! I didn't know, thanks for bringing it up!
The Haystack is made up of [A-Z] only, because that's what PosEx likes.

Also ... what's a C'sim and how do I pronounce it?
« Last Edit: September 17, 2020, 06:12:42 pm by Solstice »

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12563
  • FPC developer.
Re: A faster alternative to PosEx?
« Reply #9 on: September 17, 2020, 06:10:48 pm »
The fastcode stuff seems pretty old, all before Core-2.

Solstice

  • New member
  • *
  • Posts: 8
Re: A faster alternative to PosEx?
« Reply #10 on: September 17, 2020, 06:16:58 pm »
The fastcode stuff seems pretty old, all before Core-2.

FastCode struck me as really weird. I've looked through some code there and I can't say that I would have ever consider those I've looked at as *fast*,
and worse, the website doesn't actually make it easy or obvious to compare their code with ones own.

I admit, though, that I didn't even spend an hour on it.

Thaddy

  • Hero Member
  • *****
  • Posts: 18503
  • Here stood a man who saw the Elbe and jumped it.
Re: A faster alternative to PosEx?
« Reply #11 on: September 17, 2020, 06:20:28 pm »
It is usually the algorithm that is fast. And for particular CPU's you can examine the assembler output.
You will be surprised when tinkering with all the possible compiler settings for a certain CPU.
Due to censorship, I changed this to "Nelly the Elephant". Keeps the message clear.

Solstice

  • New member
  • *
  • Posts: 8
Re: A faster alternative to PosEx?
« Reply #12 on: September 17, 2020, 07:21:22 pm »
It is usually the algorithm that is fast. And for particular CPU's you can examine the assembler output.

That's fair and you're right, of course. Still,

Quote
You will be surprised when tinkering with all the possible compiler settings for a certain CPU.

I usually write my own assembly code and often wonder about the compilers' choices in regards to the pascal code around it,
but I don't write code "for the compiler" so sometimes I wonder why it does what it does when I apply the -O flags,
because often enough my code runs slower when I do so.

Anyhow, PosEx.

Is it an implementation of the gold standard of substring searching/pattern matching algorithms?

Is there nothing that's faster and easily available?
Google is giving me a hard time finding a fast String library for pascal.

I've been looking at KMP and Rabin-Karp, but the implementations I have don't beat PosEx,
or I'm doing something wrong.

All I really need is something that's faster running than my own pattern matching technique.
I don't want source code, just something to benchmark against, so I can beat that one too.

The fact that there's apparently no widely known library everyone uses is a bit perplexing,
further fueling my suspicion that everyone uses PosEx anyway.
« Last Edit: September 17, 2020, 07:25:02 pm by Solstice »

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12563
  • FPC developer.
Re: A faster alternative to PosEx?
« Reply #13 on: September 17, 2020, 08:19:53 pm »
It is usually the algorithm that is fast.

This is posex. Basically is "search for byte in a memory block" and if the first char matches then "compare block" in memory.

Both are assembler level primitives.

BobDog

  • Sr. Member
  • ****
  • Posts: 394
Re: A faster alternative to PosEx?
« Reply #14 on: September 17, 2020, 11:27:42 pm »

I can alter my tally function (with a flag), to return the first position.
The position is zero based, is PosEx one based?, I don't know, never used it, you may have a difference of 1.
String nearly 59 million digits.
Code: Pascal  [Select][+][-]
  1.  
  2.  
  3. program tally;
  4.  
  5.  
  6.  function tally(position:longword;somestring:pchar;partstring:pchar;flag:integer):longword;
  7. var
  8. i,j,ln,lnp,count:longword;
  9. label
  10. skip;
  11. begin
  12. ln:=length(somestring);
  13. lnp:=length(partstring);
  14. count:=0;
  15. for i:=position to ln-1 do
  16. begin
  17.    if somestring[i] <> partstring[0] then goto skip ;
  18.      if somestring[i] = partstring[0] then
  19.      begin
  20.      for j:=0 to lnp-1 do
  21.      begin
  22.      if somestring[j+i]<>partstring[j] then goto skip;
  23.      end;
  24.      if flag<>0 then exit(i);
  25.       count+=1;
  26.      end ;
  27.    skip:
  28. end;
  29.   tally:=count;
  30. end; {tally}
  31.  
  32.  
  33.  
  34.  var
  35.  p:pchar;
  36.  s:ansistring;
  37.  i:integer;
  38.  
  39.  begin
  40.  s:='abababa' ;        // create a string
  41.  for i:=1 to 23 do
  42.  begin
  43.  if (i=19) then   s+='Q';
  44.  s+=s;
  45.  end;
  46.  
  47.  p:=pchar(s);        // cast
  48.  
  49.   writeln('string length  =  ',length(p));
  50.  
  51.  
  52.  writeln('Number of occurencies of character Q = ',tally(0,p,'Q',0));   // 0 flag
  53.  
  54.  writeln;
  55.  writeln('First occurence of character Q = ',tally(0,p,'Q',1) ) ;  // non zero flag
  56.  
  57.  
  58.  writeln('Press enter to end');
  59.  readln;
  60.  
  61.  end.
  62.  
  63.  
  64.      

 

TinyPortal © 2005-2018