Forum > General
Code-Review Request: Alternative to PosSetEx from StrUtils
Zvoni:
Hi Folks,
for the last 20 years (VB6/VBA), 2 of my favorite (Windows-) API's are "StrSpn" and "StrCSpn", which are "classics" for tokenizing strings.
For this thread, i'm concentrating on "StrCSpn"
The C Non-Unicode (single-byte) String-Version uses some clever mapping with the extended ASCII-Set
(see here: https://codebrowser.dev/glibc/glibc/string/strcspn.c.html),
but nowadays that's obviously not enough.
So i've looked at alternative implementations (like here: https://android.googlesource.com/platform/bionic/+/a27d2baa/libc/string/strcspn.c)
Bleh, they are basically throwing each character of Source at the "Rejects", checking if it apears in Rejects.
And that, ladies and gentlemen, rubs me the wrong way.
Why? Imagine a really, really long string (say, some 10K or more characters), and the first Reject appears somewhere to the end, or worse, no Reject in Source at all.
Now, i've been looking for something along those lines in Freepascal, and found PosSetEx from StrUtils to come closest to it.
Looking at the Source-Code of PosSetEx, color me surprised: It's the same implementation!!
Walking character by character through the Source checking if it exists in the (converted to a) Set of Chars we want to stop at.
So i wrote my own "StrCSpn", and as the thread-title says, i'm requesting code-review
Note 1: I've chosen "Longword" for the simple reason that i was looking for an unsigned Integer.
Note 2: There're 2 major differences between StrCSpn and PosSetEx
Difference 1) StrCSpn returns how many characters of Source appear BEFORE a Reject-Character. PosSetEx returns the Position of the first Reject-Character
Difference 2) StrCSpn returns the Length of Source in case no Reject-Character has been found. PosSetEx returns 0 in this case
Improvements and/or alternatives are welcome.
If anyone has a Testsuite to throw against, i'd welcome any performance- and/or bug-reports
i hope my comments are sufficient
--- Code: Pascal [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---//The "original" StrCSpn-C-Implementation --> corresponds to "PosSetEx" in StrUtils//throws the whole Source-String char-by-char at "Reject", checking if it appears there//if yes, return that found position --> WTF??//Looking at the Source-code of PosSetEx in strutils.pp it's doing the same//If the Source-String is really long, and the first occurence of a "Reject"-Character//is very late........ talk about wasted performance//Note: StrCSpn is "one off" compared to PosSetExfunction StrCSpn(const Source:String;const Reject:String):LongWord;Var i:LongWord; f:LongWord=0; //Counter for "found rejects" a:LongWord; ls:LongWord; lr:LongWord; r:Array Of LongWord; Function Min(Const AArray:Array Of LongWord):LongWord; Var i:LongWord; Begin //If there is only one Member, return it immediately If Length(AArray)=1 Then Exit(AArray[0]); //Setting Result to failure //To conform with PosSetEx the next line should read: //Result:=ls; Result:=ls+1; For i:=Low(AArray) To High(AArray) Do If AArray[i]<Result Then Result:=AArray[i]; end; begin //Sanity-Check: Whyever would you pass an empty Source or Reject-String?? ls:=Length(Source); lr:=Length(Reject); If ls=0 Then Exit(0); If lr=0 Then Exit(ls); //We're only interested in the first occurence of each "Reject"-Character //So the Max for "first occurences" is the length of Reject --> //each character in reject appears at least once SetLength(r,lr); For i:=1 To lr Do Begin //Getting the first Position of each "Reject"-Character //There might be "0" in there if a "Reject"-Character wasn't found. //We're excluding them a:=Pos(Reject[i],Source); If a>0 Then Begin r[f]:=a; Inc(f); end; end; //Cutting off unneeded members, since they're initialized with "0" SetLength(r,f); //Looking for the lowest Value, We already excluded the "not found" //Only makes sense to send at "Min" if at least 1 Reject-Character has been found //Otherwise return length of Source //To conform with PosSetEx the next line should read: //If Length(r)>0 Then a:=Min(r) Else a:=0; If Length(r)>0 Then a:=Min(r)-1 Else a:=ls; Result:=a;end;
Martin_fr:
Have you considered a "set of char" or "array [char] of boolean" ?
Map every reject into the set/array, and the iterating the string once.
Currently if you have 50 reject chars, you have to do 50 times "pos()" (and if that has to go over a 10k string each time)...
I don't know how well optimized "pos" is compared to a custom loop, but as the reject count grows it will become more expensive.
Zvoni:
--- Quote from: Martin_fr on March 27, 2023, 04:10:16 pm ---Have you considered a "set of char" or "array [char] of boolean" ?
Map every reject into the set/array, and the iterating the string once.
Currently if you have 50 reject chars, you have to do 50 times "pos()" (and if that has to go over a 10k string each time)...
I don't know how well optimized "pos" is compared to a custom loop, but as the reject count grows it will become more expensive.
--- End quote ---
Thx Martin.
Honestly? I’m not that firm in Sets to „judge“ it.
And the array of Boolean is basically the way the „original“ StrCSpn was implemented.
Would have to think about it.
Yes, correct with the times to call Pos, but OTOH, how is 2345 times calling the In-Operator against the Set better until you hit a „positive“?
Martin_fr:
--- Quote from: Zvoni on March 27, 2023, 04:51:50 pm ---
--- Quote from: Martin_fr on March 27, 2023, 04:10:16 pm ---Have you considered a "set of char" or "array [char] of boolean" ?
Map every reject into the set/array, and the iterating the string once.
Currently if you have 50 reject chars, you have to do 50 times "pos()" (and if that has to go over a 10k string each time)...
I don't know how well optimized "pos" is compared to a custom loop, but as the reject count grows it will become more expensive.
--- End quote ---
Thx Martin.
Honestly? I’m not that firm in Sets to „judge“ it.
And the array of Boolean is basically the way the „original“ StrCSpn was implemented.
Would have to think about it.
Yes, correct with the times to call Pos, but OTOH, how is 2345 times calling the In-Operator against the Set better until you hit a „positive“?
--- End quote ---
The array and the set are basically the same.
- The set is 1 bit per char (and with that the need to calculate the bit pos).
- The array (if you do "array [char] of bytebool") is one byte, and you can use the char as index.
Since a 256 byte array easily fits in the cpu cache, this may even be faster, but no idea / not tested.
Using the array, is ONE loop, exactly like calling "pos" once and once only.
The diff between the array, and "pos" is that
- pos does "text[n] = chr"
- array does "array[text[n]]" (returns bool)
All else should be the same.
So technically the array does an extra memory lookup (from cpu cache). I would guess that most modern cpu will execute this in the same time (since they do multiple instructions per tick, and the time killer is not the lookup but the loop condition).
But even if that takes 10% longer (which I figure to be an extreme high value), then the overall result is still faster unless you are finding the reject in the first run of pos for at least 90% of the calls.
Of course, that all assumes that
- pos is not faster because it is coded in asm
- your loop is as well optimized by the compiler as pos may be.
- you can spare 256 bytes on the stack for the time of the function running (or spare them on the heap or in the data segment)
Zvoni:
How would that 256 byte array work with multi byte strings?
Navigation
[0] Message Index
[#] Next page