Recent

Author Topic: Code-Review Request: Alternative to PosSetEx from StrUtils  (Read 1565 times)

Zvoni

  • Hero Member
  • *****
  • Posts: 2690
Code-Review Request: Alternative to PosSetEx from StrUtils
« on: March 27, 2023, 03:24:40 pm »
Hi Folks,
for the last 20 years (VB6/VBA), 2 of my favorite (Windows-) API's are "StrSpn" and "StrCSpn", which are "classics" for tokenizing strings.
For this thread, i'm concentrating on "StrCSpn"

The C Non-Unicode (single-byte) String-Version uses some clever mapping with the extended ASCII-Set
(see here: https://codebrowser.dev/glibc/glibc/string/strcspn.c.html),
but nowadays that's obviously not enough.
So i've looked at alternative implementations (like here: https://android.googlesource.com/platform/bionic/+/a27d2baa/libc/string/strcspn.c)

Bleh, they are basically throwing each character of Source at the "Rejects", checking if it apears in Rejects.

And that, ladies and gentlemen, rubs me the wrong way.
Why? Imagine a really, really long string (say, some 10K or more characters), and the first Reject appears somewhere to the end, or worse, no Reject in Source at all.

Now, i've been looking for something along those lines in Freepascal, and found PosSetEx from StrUtils to come closest to it.
Looking at the Source-Code of PosSetEx, color me surprised: It's the same implementation!!
Walking character by character through the Source checking if it exists in the (converted to a) Set of Chars we want to stop at.

So i wrote my own "StrCSpn", and as the thread-title says, i'm requesting code-review

Note 1: I've chosen "Longword" for the simple reason that i was looking for an unsigned Integer.
Note 2: There're 2 major differences between StrCSpn and PosSetEx
Difference 1) StrCSpn returns how many characters of Source appear BEFORE a Reject-Character. PosSetEx returns the Position of the first Reject-Character
Difference 2) StrCSpn returns the Length of Source in case no Reject-Character has been found. PosSetEx returns 0 in this case

Improvements and/or alternatives are welcome.
If anyone has a Testsuite to throw against, i'd welcome any performance- and/or bug-reports

i hope my comments are sufficient

Code: Pascal  [Select][+][-]
  1. //The "original" StrCSpn-C-Implementation --> corresponds to "PosSetEx" in StrUtils
  2. //throws the whole Source-String char-by-char at "Reject", checking if it appears there
  3. //if yes, return that found position --> WTF??
  4. //Looking at the Source-code of PosSetEx in strutils.pp it's doing the same
  5. //If the Source-String is really long, and the first occurence of a "Reject"-Character
  6. //is very late........ talk about wasted performance
  7. //Note: StrCSpn is "one off" compared to PosSetEx
  8. function StrCSpn(const Source:String;const Reject:String):LongWord;
  9. Var
  10.   i:LongWord;
  11.   f:LongWord=0;  //Counter for "found rejects"
  12.   a:LongWord;
  13.   ls:LongWord;
  14.   lr:LongWord;
  15.   r:Array Of LongWord;
  16.  
  17.   Function Min(Const AArray:Array Of LongWord):LongWord;
  18.   Var
  19.     i:LongWord;
  20.   Begin
  21.     //If there is only one Member, return it immediately
  22.     If Length(AArray)=1 Then Exit(AArray[0]);
  23.     //Setting Result to failure
  24.     //To conform with PosSetEx the next line should read:
  25.     //Result:=ls;
  26.     Result:=ls+1;
  27.     For i:=Low(AArray) To High(AArray) Do
  28.       If AArray[i]<Result Then Result:=AArray[i];
  29.   end;
  30.  
  31. begin
  32.   //Sanity-Check: Whyever would you pass an empty Source or Reject-String??
  33.   ls:=Length(Source);
  34.   lr:=Length(Reject);
  35.   If ls=0 Then Exit(0);
  36.   If lr=0 Then Exit(ls);
  37.   //We're only interested in the first occurence of each "Reject"-Character
  38.   //So the Max for "first occurences" is the length of Reject -->
  39.   //each character in reject appears at least once
  40.   SetLength(r,lr);
  41.   For i:=1 To lr Do
  42.     Begin
  43.       //Getting the first Position of each "Reject"-Character
  44.       //There might be "0" in there if a "Reject"-Character wasn't found.
  45.       //We're excluding them
  46.       a:=Pos(Reject[i],Source);
  47.       If a>0 Then
  48.         Begin
  49.           r[f]:=a;
  50.           Inc(f);
  51.         end;
  52.     end;
  53.   //Cutting off unneeded members, since they're initialized with "0"
  54.   SetLength(r,f);
  55.   //Looking for the lowest Value, We already excluded the "not found"
  56.   //Only makes sense to send at "Min" if at least 1 Reject-Character has been found
  57.   //Otherwise return length of Source
  58.   //To conform with PosSetEx the next line should read:
  59.   //If Length(r)>0 Then a:=Min(r) Else a:=0;
  60.   If Length(r)>0 Then a:=Min(r)-1 Else a:=ls;
  61.   Result:=a;
  62. end;
One System to rule them all, One Code to find them,
One IDE to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------
Code is like a joke: If you have to explain it, it's bad

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 10305
  • Debugger - SynEdit - and more
    • wiki
Re: Code-Review Request: Alternative to PosSetEx from StrUtils
« Reply #1 on: March 27, 2023, 04:10:16 pm »
Have you considered a "set of char" or "array [char] of boolean" ?
Map every reject into the set/array, and the iterating the string once.

Currently if you have 50 reject chars, you have to do 50 times "pos()" (and if that has to go over a 10k string each time)...
I don't know how well optimized "pos" is compared to a custom loop, but as the reject count grows it will become more expensive.

Zvoni

  • Hero Member
  • *****
  • Posts: 2690
Re: Code-Review Request: Alternative to PosSetEx from StrUtils
« Reply #2 on: March 27, 2023, 04:51:50 pm »
Have you considered a "set of char" or "array [char] of boolean" ?
Map every reject into the set/array, and the iterating the string once.

Currently if you have 50 reject chars, you have to do 50 times "pos()" (and if that has to go over a 10k string each time)...
I don't know how well optimized "pos" is compared to a custom loop, but as the reject count grows it will become more expensive.
Thx Martin.
Honestly? I’m not that firm in Sets to „judge“ it.
And the array of Boolean is basically the way the „original“ StrCSpn was implemented.
Would have to think about it.

Yes, correct with the times to call Pos, but OTOH, how is 2345 times calling the In-Operator against the Set better until you hit a „positive“?
« Last Edit: March 27, 2023, 04:57:00 pm by Zvoni »
One System to rule them all, One Code to find them,
One IDE to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------
Code is like a joke: If you have to explain it, it's bad

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 10305
  • Debugger - SynEdit - and more
    • wiki
Re: Code-Review Request: Alternative to PosSetEx from StrUtils
« Reply #3 on: March 27, 2023, 05:50:14 pm »
Have you considered a "set of char" or "array [char] of boolean" ?
Map every reject into the set/array, and the iterating the string once.

Currently if you have 50 reject chars, you have to do 50 times "pos()" (and if that has to go over a 10k string each time)...
I don't know how well optimized "pos" is compared to a custom loop, but as the reject count grows it will become more expensive.
Thx Martin.
Honestly? I’m not that firm in Sets to „judge“ it.
And the array of Boolean is basically the way the „original“ StrCSpn was implemented.
Would have to think about it.

Yes, correct with the times to call Pos, but OTOH, how is 2345 times calling the In-Operator against the Set better until you hit a „positive“?

The array and the set are basically the same.
- The set is 1 bit per char (and with that the need to calculate the bit pos).
- The array (if you do "array [char] of bytebool") is one byte, and you can use the char as index.

Since a 256 byte array easily fits in the cpu cache, this may even be faster, but no idea / not tested.

Using the array, is ONE loop, exactly like calling "pos" once and once only.
The diff between the array, and "pos" is that
- pos does "text[n] = chr"
- array does  "array[text[n]]" (returns bool)
All else should be the same.

So technically the array does an extra memory lookup (from cpu cache). I would guess that most modern cpu will execute this in the same time (since they do multiple instructions per tick, and the time killer is not the lookup but the loop condition).
But even if that takes 10% longer (which I figure to be an extreme high value), then the overall result is still faster unless you are finding the reject in the first run of pos for at least 90% of the calls.

Of course, that all assumes that
- pos is not faster because it is coded in asm
- your loop is as well optimized by the compiler as pos may be.
- you can spare 256 bytes on the stack for the time of the function running (or spare them on the heap or in the data segment)

Zvoni

  • Hero Member
  • *****
  • Posts: 2690
Re: Code-Review Request: Alternative to PosSetEx from StrUtils
« Reply #4 on: March 27, 2023, 06:26:44 pm »
How would that 256 byte array work with multi byte strings?
One System to rule them all, One Code to find them,
One IDE to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------
Code is like a joke: If you have to explain it, it's bad

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 10305
  • Debugger - SynEdit - and more
    • wiki
Re: Code-Review Request: Alternative to PosSetEx from StrUtils
« Reply #5 on: March 27, 2023, 06:54:59 pm »
How would that 256 byte array work with multi byte strings?

Well, his current code (from his post) searches each char (byte sized char) using pos. (Of course, if he changes to widestring, then an array for dealing with widechar would not work so well (64 KB of data, or bitwise still 8KB).

Depending on the use case, search the first byte of each of the given byte-sequences. Then check all sequences starting with that byte. Though that is not optimal, especially if you search for substrings (where there may be more than only a 2nd byte).

For widestring you could also go and be optimistic, and have TWO 256 byte arrays (for the 2 bytes of each widechar), and then if both bytes are in some-reject char, verify if it is a valid combination. That will work for a small list of reject chars, where you can expect few false positives. But if you have lots of "reject chars" then you may hit almost every widechar, and it becomes really inefficient.

You can use "tries" https://en.wikipedia.org/wiki/Trie.  And there is some info on how to avoid backtracing https://en.wikipedia.org/wiki/String-searching_algorithm

jamie

  • Hero Member
  • *****
  • Posts: 6577
Re: Code-Review Request: Alternative to PosSetEx from StrUtils
« Reply #6 on: March 27, 2023, 11:38:21 pm »
I don't see what's wrong with PosSetEx  except that the parameters are backwards verses the C version.
The only true wisdom is knowing you know nothing

Zvoni

  • Hero Member
  • *****
  • Posts: 2690
Re: Code-Review Request: Alternative to PosSetEx from StrUtils
« Reply #7 on: March 28, 2023, 08:33:56 am »
I don't see what's wrong with PosSetEx  except that the parameters are backwards verses the C version.
I didn't say there is something wrong with PosSetEx.
I said it rubs me the wrong way how it's implemented, sequentially throwing each character of Source against the Rejects checking if it's in there.

Say i have a Source-String with 2846 Characters, and i throw it against Rejects with, say, 8 Characters, and the first Reject out of the 8 is at Position 2235.
PosSetEx does 2234 compares against the rejects, while my algorithm does 8 "Pos"-Calls with 8 compares in the Min-Function (worst case).

I'm still looking for a suitable testsuite with such strings
One System to rule them all, One Code to find them,
One IDE to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------
Code is like a joke: If you have to explain it, it's bad

Zvoni

  • Hero Member
  • *****
  • Posts: 2690
Re: Code-Review Request: Alternative to PosSetEx from StrUtils
« Reply #8 on: March 28, 2023, 08:49:11 am »
I don't know how well optimized "pos" is compared to a custom loop, but as the reject count grows it will become more expensive.
Just found the Source for Pos.
Basically it runs through the string comparing the character, albeit as a pointer
You might be right that Pos is "expensive"
One System to rule them all, One Code to find them,
One IDE to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------
Code is like a joke: If you have to explain it, it's bad

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11771
  • FPC developer.
Re: Code-Review Request: Alternative to PosSetEx from StrUtils
« Reply #9 on: March 28, 2023, 09:03:02 am »
Say i have a Source-String with 2846 Characters, and i throw it against Rejects with, say, 8 Characters, and the first Reject out of the 8 is at Position 2235.
PosSetEx does 2234 compares against the rejects, while my algorithm does 8 "Pos"-Calls with 8 compares in the Min-Function (worst case).

Always check best and worst case scenarios for each implementation. If you have ten characters in the set, and the only match is with the last one on pos 2235, your routine examines 2846*9+2235 characters with pos.

When walking a string multiple times (the various POS() for each set item) Things get even worse as soon as the string is very long, and when walking the string, the first bits go out of the cache.

The existing set does only 2235 more involved comparisons with minimal memory caching effects.


Jamie: the function originates in Modula2. But I think I styled the parameter order after existing pascal/delphi, which generally does substr,str,index

Zvoni

  • Hero Member
  • *****
  • Posts: 2690
Re: Code-Review Request: Alternative to PosSetEx from StrUtils
« Reply #10 on: March 28, 2023, 09:28:07 am »
Say i have a Source-String with 2846 Characters, and i throw it against Rejects with, say, 8 Characters, and the first Reject out of the 8 is at Position 2235.
PosSetEx does 2234 compares against the rejects, while my algorithm does 8 "Pos"-Calls with 8 compares in the Min-Function (worst case).

Always check best and worst case scenarios for each implementation. If you have ten characters in the set, and the only match is with the last one on pos 2235, your routine examines 2846*9+2235 characters with pos.

When walking a string multiple times (the various POS() for each set item) Things get even worse as soon as the string is very long, and when walking the string, the first bits go out of the cache.

The existing set does only 2235 more involved comparisons with minimal memory caching effects.
Damn!
Thx marco for clearing that up.
Though my algorithm looks nice, it's way more expensive, if explained that way.
Thank you.
One System to rule them all, One Code to find them,
One IDE to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------
Code is like a joke: If you have to explain it, it's bad

 

TinyPortal © 2005-2018