Hi Folks,
for the last 20 years (VB6/VBA), 2 of my favorite (Windows-) API's are "StrSpn" and "StrCSpn", which are "classics" for tokenizing strings.
For this thread, i'm concentrating on "StrCSpn"
The C Non-Unicode (single-byte) String-Version uses some clever mapping with the extended ASCII-Set
(see here:
https://codebrowser.dev/glibc/glibc/string/strcspn.c.html),
but nowadays that's obviously not enough.
So i've looked at alternative implementations (like here:
https://android.googlesource.com/platform/bionic/+/a27d2baa/libc/string/strcspn.c)
Bleh, they are basically throwing each character of Source at the "Rejects", checking if it apears in Rejects.
And that, ladies and gentlemen, rubs me the wrong way.
Why? Imagine a really, really long string (say, some 10K or more characters), and the first Reject appears somewhere to the end, or worse, no Reject in Source at all.
Now, i've been looking for something along those lines in Freepascal, and found PosSetEx from StrUtils to come closest to it.
Looking at the Source-Code of PosSetEx, color me surprised: It's the same implementation!!
Walking character by character through the Source checking if it exists in the (converted to a) Set of Chars we want to stop at.
So i wrote my own "StrCSpn", and as the thread-title says, i'm requesting code-review
Note 1: I've chosen "Longword" for the simple reason that i was looking for an unsigned Integer.
Note 2: There're 2 major differences between StrCSpn and PosSetEx
Difference 1) StrCSpn returns how many characters of Source appear BEFORE a Reject-Character. PosSetEx returns the Position of the first Reject-Character
Difference 2) StrCSpn returns the Length of Source in case no Reject-Character has been found. PosSetEx returns 0 in this case
Improvements and/or alternatives are welcome.
If anyone has a Testsuite to throw against, i'd welcome any performance- and/or bug-reports
i hope my comments are sufficient
//The "original" StrCSpn-C-Implementation --> corresponds to "PosSetEx" in StrUtils
//throws the whole Source-String char-by-char at "Reject", checking if it appears there
//if yes, return that found position --> WTF??
//Looking at the Source-code of PosSetEx in strutils.pp it's doing the same
//If the Source-String is really long, and the first occurence of a "Reject"-Character
//is very late........ talk about wasted performance
//Note: StrCSpn is "one off" compared to PosSetEx
function StrCSpn(const Source:String;const Reject:String):LongWord;
Var
i:LongWord;
f:LongWord=0; //Counter for "found rejects"
a:LongWord;
ls:LongWord;
lr:LongWord;
r:Array Of LongWord;
Function Min(Const AArray:Array Of LongWord):LongWord;
Var
i:LongWord;
Begin
//If there is only one Member, return it immediately
If Length(AArray)=1 Then Exit(AArray[0]);
//Setting Result to failure
//To conform with PosSetEx the next line should read:
//Result:=ls;
Result:=ls+1;
For i:=Low(AArray) To High(AArray) Do
If AArray[i]<Result Then Result:=AArray[i];
end;
begin
//Sanity-Check: Whyever would you pass an empty Source or Reject-String??
ls:=Length(Source);
lr:=Length(Reject);
If ls=0 Then Exit(0);
If lr=0 Then Exit(ls);
//We're only interested in the first occurence of each "Reject"-Character
//So the Max for "first occurences" is the length of Reject -->
//each character in reject appears at least once
SetLength(r,lr);
For i:=1 To lr Do
Begin
//Getting the first Position of each "Reject"-Character
//There might be "0" in there if a "Reject"-Character wasn't found.
//We're excluding them
a:=Pos(Reject[i],Source);
If a>0 Then
Begin
r[f]:=a;
Inc(f);
end;
end;
//Cutting off unneeded members, since they're initialized with "0"
SetLength(r,f);
//Looking for the lowest Value, We already excluded the "not found"
//Only makes sense to send at "Min" if at least 1 Reject-Character has been found
//Otherwise return length of Source
//To conform with PosSetEx the next line should read:
//If Length(r)>0 Then a:=Min(r) Else a:=0;
If Length(r)>0 Then a:=Min(r)-1 Else a:=ls;
Result:=a;
end;