Recent

Author Topic: [SOLVED] strpos() Function results 2x slower than Perl's index() Function  (Read 6740 times)

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11449
  • FPC developer.
https://www.freepascal.org/docs-html/rtl/strutils/findmatchesboyermoorecasesensitive.html

Thanks, the stringreplace was fixed, I fixed the other one, but probably that won't make 3.2.0 docs anymore

domibay_hugo

  • New Member
  • *
  • Posts: 37
  • Site Reliabilty / DevOps Engineer at Domibay S.L.
    • GitHub Profile
Indexbyte will only be worthwhile for longer strings. You use it to find the first matching char and then do a comparebyte to compare the found ptrs.

Thinking of this Documentation:
https://www.man7.org/linux/man-pages/man3/memchr.3.html

I was wondering if the IndexByte() Function could be useful to write a faster strlen() Function just like in this Example:
Quote
The following call is a fast means of locating a string's terminating null byte:

           char *p = rawmemchr(s, '\0');

So I developed this little Function:
Code: Pascal  [Select][+][-]
  1. {$IFOPT D+}
  2.   {$NOTE debug mode is active. disabling inline}
  3.   {$DEFINE debug_on}
  4. {$ENDIF}
  5.  
  6. function PCharLength(const ssource: PChar): Cardinal; {$IFNDEF debug_on} inline; {$ENDIF}
  7. var
  8.   pssrc: PChar;
  9.   imtchps: Integer;
  10.   isrhrng: Cardinal;
  11. begin
  12.   Result := 0;
  13.  
  14.   pssrc := ssource;
  15.   isrhrng := 1024;
  16.  
  17.   repeat  //until imtchps <> -1;
  18.     imtchps := IndexByte(pssrc^, isrhrng, 0);
  19.  
  20.     if imtchps = -1 then
  21.     begin
  22.       inc(Result, isrhrng);
  23.       inc(pssrc, isrhrng);
  24.  
  25.       isrhrng := isrhrng * 2;
  26.     end;  //if imtchps = -1 then
  27.   until imtchps <> -1;
  28.  
  29.   inc(Result, imtchps);
  30. end;
  31.  

I made a little benchmark to see if there is any measurable improvement:

-------
Big Source 1 (length: '33044') - strlen (count: '1000') - Start - Now in millisecs since midnight : 44566167
Big Source 1 - strlen - End - Now in millisecs since midnight : 44566191
Big Source 1: strlen completed in '23.9996938034892' ms.
Big Source 1 (length: '33044') - PCharLength (count: '1000') - Start - Now in millisecs since midnight : 44566191
Big Source 1 - PCharLength - End - Now in millisecs since midnight : 44566212
Big Source 1: PCharLength completed in '21.0004393011332' ms.
-------
Big Source 1 (length: '66088') - strlen (count: '1000') - Start - Now in millisecs since midnight : 44566212
Big Source 1 - strlen - End - Now in millisecs since midnight : 44566250
Big Source 1: strlen completed in '37.9995675757527' ms.
Big Source 1 (length: '66088') - PCharLength (count: '1000') - Start - Now in millisecs since midnight : 44566250
Big Source 1 - PCharLength - End - Now in millisecs since midnight : 44566283
Big Source 1: PCharLength completed in '32.9999718815088' ms.


In strings > 32KB you can notice an advantage of 2 - 3 ms on 1000 Operations
and on strings > 64 KB you can observe an advantage of 3 - 5 ms on 1000 Operations

domibay_hugo

  • New Member
  • *
  • Posts: 37
  • Site Reliabilty / DevOps Engineer at Domibay S.L.
    • GitHub Profile
StrPos() is a C support function and indeed is much slower than Pascal native Pos() family.
StrPos() also FAILS on true Pascal type strings. (Since length is known and embedded #0's are supported in Pascal string types)
It is geared towards PChars. (Which is famously not the brightest ideafrom curly bracket language designers....).
If you have the need for StrPos(), you are not using Pascal features, just a rude version to make some Pascal interact with the above silly design of C like "strings". Technically C as a language does not even know strings. Do away with any of the curly brackets and you are fine as far as "strings" are concerned.

Basically: Pascal does not need to check for buffer overruns.
Algorithmic, walking a Pascal string or a C style buffer should be the same in O notation: O(n). But technically, true Pascal stringtypes - should and do -  have an advantage.
And Strpos() is not really Pascal.....

Actually in Praxis PChar in FreePascal is the missing String Byte Iterator to get access to the Data and avoid Copying. Widely found in the FCL.

As shown in this little Snypplet:
Code: Pascal  [Select][+][-]
  1. procedure TestPCharFromString;
  2. var
  3.   stream: TStringStream;
  4.   sdata: String;
  5.   psdata: PAnsiString;
  6.   pcdata: PChar;
  7.   idatalength: Integer;
  8. begin
  9. stream := TStringStream.Create('');
  10.  
  11. stream.WriteString('My unchangeable Source String. Is it really unchangeable?');
  12.  
  13. stream.WriteByte(0);
  14.  
  15. pcdata := @stream.DataString[1];
  16. idatalength := stream.Position - 1;
  17.  
  18. writeln('stream 0 (length: ', chr(39), idatalength, chr(39), '): ', chr(39), stream.DataString, chr(39));
  19.  
  20. pcdata^ := 'b';
  21. inc(pcdata);
  22. pcdata^ := 'l';
  23. inc(pcdata);
  24. pcdata^ := 'a';
  25. inc(pcdata);
  26. pcdata^ := ' ';
  27.  
  28. writeln('stream 1 (length: ', chr(39), idatalength, chr(39), '): ', chr(39), stream.DataString, chr(39));
  29.  
  30. pcdata := PChar(stream.DataString);
  31.  
  32. pcdata[0] := 'f';
  33. pcdata[1] := 'o';
  34. pcdata[2] := 'o';
  35.  
  36. writeln('stream 2 (length: ', chr(39), idatalength, chr(39), '): ', chr(39), stream.DataString, chr(39));
  37.  
  38. sdata := stream.DataString;
  39.  
  40. sdata[1] := 'I';
  41. sdata[2] := ' ';
  42. sdata[3] := 'a';
  43. sdata[4] := 'm';
  44. sdata[5] := ' ';
  45. sdata[6] := 'a';
  46. sdata[7] := ' ';
  47. sdata[8] := 'C';
  48. sdata[9] := 'o';
  49. sdata[10] := 'p';
  50. sdata[11] := 'y';
  51. sdata[12] := ' ';
  52.  
  53. writeln('stream 3 (length: ', chr(39), idatalength, chr(39), '): ', chr(39), stream.DataString, chr(39));
  54. writeln('data 3 (length: ', chr(39), Length(sdata), chr(39), '): ', chr(39), sdata, chr(39));
  55.  
  56. psdata := @stream.DataString;
  57.  
  58. psdata^[1] := 'I';
  59. psdata^[2] := ' ';
  60. psdata^[3] := 'a';
  61. psdata^[4] := 'm';
  62. psdata^[5] := ' ';
  63. psdata^[6] := 'o';
  64. psdata^[7] := 'r';
  65. psdata^[8] := 'i';
  66. psdata^[9] := 'g';
  67. psdata^[10] := 'i';
  68. psdata^[11] := 'n';
  69. psdata^[12] := 'a';
  70. psdata^[13] := 'l';
  71. psdata^[14] := ' ';
  72.  
  73. writeln('stream 4 (length: ', chr(39), Length(psdata^), chr(39), '): ', chr(39), stream.DataString, chr(39));
  74.  
  75. stream.Free;
  76. stream := Nil;
  77. end;
  78.  

Which produces:

stream 0 (length: '57'): 'My unchangeable Source String. Is it really unchangeable?'
stream 1 (length: '57'): 'bla nchangeable Source String. Is it really unchangeable?'
stream 2 (length: '57'): 'foo nchangeable Source String. Is it really unchangeable?'
stream 3 (length: '57'): 'foo nchangeable Source String. Is it really unchangeable?'
data 3 (length: '58'): 'I am a Copy ble Source String. Is it really unchangeable?'
stream 4 (length: '58'): 'I am original e Source String. Is it really unchangeable?'


The PChar psdata is a Pointer to the Bytes in stream.DataString.
While the String sdata is a Copy of the whole String stream.DataString.
« Last Edit: June 19, 2020, 01:49:21 pm by domibay_hugo »

PascalDragon

  • Hero Member
  • *****
  • Posts: 5469
  • Compiler Developer
Actually in Praxis PChar in FreePascal is the missing String Byte Iterator to get access to the Data and avoid Copying. Widely found in the FCL.

The PChar psdata is a Pointer to the Bytes in stream.DataString.
While the String sdata is a Copy of the whole String stream.DataString.

Not quite. As long as you don't change sdata it will point to the same string data as the internal field of TStringStream. But once you change the string it will be made unique. This is the Copy-On-Write mechanism of managed strings. To see this in action you should adjust your example like this:

Code: Pascal  [Select][+][-]
  1. //...
  2. writeln('stream 2 (length: ', chr(39), idatalength, chr(39), '): ', chr(39), stream.DataString, chr(39));
  3.  
  4. sdata := stream.DataString;
  5. Writeln(StringRefCount(sdata));
  6.  
  7. sdata[1] := 'I';
  8. Writeln(StringRefCount(sdata));
  9. sdata[2] := ' ';
  10. sdata[3] := 'a';
  11. //...

Notice the two calls to StringRefCount which will return the current reference count of the string. The output will then be:

Code: [Select]
PS E:\fpc\git> .\testoutput\tstrstrm.exe
stream 0 (length: '57'): 'My unchangeable Source String. Is it really unchangeable? '
stream 1 (length: '57'): 'bla nchangeable Source String. Is it really unchangeable? '
stream 2 (length: '57'): 'foo nchangeable Source String. Is it really unchangeable? '
2
1
stream 3 (length: '57'): 'foo nchangeable Source String. Is it really unchangeable? '
data 3 (length: '58'): 'I am a Copy ble Source String. Is it really unchangeable? '
stream 4 (length: '58'): 'I am original e Source String. Is it really unchangeable? '

As you can see the reference count is first 2, because both the stream and sdata reference the string data, but upon modifying the string the reference count becomes one, because the string has been made unique. This can also be seen in the assembly code:

Code: [Select]
# [45] sdata := stream.DataString;
movl -4(%ebp),%eax
movl 4(%eax),%edx
leal -8(%ebp),%eax
call fpc_ansistr_assign
# [46] Writeln(StringRefCount(sdata));
call fpc_get_output
movl %eax,%ebx
movl -8(%ebp),%eax
call SYSTEM_$$_STRINGREFCOUNT$RAWBYTESTRING$$LONGINT
movl %eax,%ecx
movl %ebx,%edx
movl $0,%eax
call fpc_write_text_sint
call FPC_IOCHECK
movl %ebx,%eax
call fpc_writeln_end
call FPC_IOCHECK
# [48] sdata[1] := 'I';
leal -8(%ebp),%eax
call fpc_ansistr_unique
movb $73,(%eax)

Thus, as long as you don't change the string you don't need to access it as a PChar.

domibay_hugo

  • New Member
  • *
  • Posts: 37
  • Site Reliabilty / DevOps Engineer at Domibay S.L.
    • GitHub Profile
As long as you don't change sdata it will point to the same string data as the internal field of TStringStream. But once you change the string it will be made unique. This is the Copy-On-Write mechanism of managed strings.
Thus, as long as you don't change the string you don't need to access it as a PChar.

This becomes ever more intriguing because I understand to convert a PascalString into a valid PChar you need to add the NULL Byte.

So if I change my example to:
Code: Pascal  [Select][+][-]
  1. stream := TStringStream.Create('');
  2.  
  3. stream.WriteString('My unchangeable Source String. Is it really unchangeable?');
  4.  
  5. writeln('stream 0 (length: ', chr(39), stream.Size, chr(39), '): ', chr(39), stream.DataString, chr(39));
  6.  
  7. pcdata := PChar(stream.DataString);
  8. idatalength := stream.Position;
  9.  
  10. pcdata[0] := 'f';
  11. pcdata[1] := 'o';
  12. pcdata[2] := 'o';
  13. pcdata[3] := ' ';
  14.  
  15. writeln('stream 1 (length: ', chr(39), idatalength, chr(39), '): ', chr(39), stream.DataString, chr(39));
  16.  
  17. stream.WriteByte(0);
  18.  
  19. pcdata := @stream.DataString[1];
  20. idatalength := stream.Position - 1;
  21.  
  22. writeln('stream 2 (length: ', chr(39), idatalength, chr(39), '): ', chr(39), stream.DataString, chr(39));
  23.  
  24. pcdata^ := 'b';
  25. inc(pcdata);
  26. pcdata^ := 'l';
  27. inc(pcdata);
  28. pcdata^ := 'a';
  29. inc(pcdata);
  30. pcdata^ := ' ';
  31.  
  32. writeln('stream 3 (length: ', chr(39), idatalength, chr(39), '): ', chr(39), stream.DataString, chr(39));
  33.  
  34.  

I get an Output like:

stream 0 (length: '57'): 'My unchangeable Source String. Is it really unchangeable?'
stream 1 (length: '57'): 'foo nchangeable Source String. Is it really unchangeable?'
stream 2 (length: '57'): 'foo nchangeable Source String. Is it really unchangeable?'
stream 3 (length: '57'): 'bla nchangeable Source String. Is it really unchangeable?'


At the Operation:
Code: Pascal  [Select][+][-]
  1. pcdata := PChar(stream.DataString);
  2.  

Where is the NULL Byte added since the PChar pcdata always points to the internal data of TStringStream.DataString and at this point does not have any ?

PascalDragon

  • Hero Member
  • *****
  • Posts: 5469
  • Compiler Developer
Where is the NULL Byte added since the PChar pcdata always points to the internal data of TStringStream.DataString and at this point does not have any ?

Pascal strings always end with a NUL to simplify conversion to PChar. Though of course that NUL won't have any effect if a previous NUL is encountered. ;)

Just take a look at the generated assembly for the string.

Take this code for example:

Code: Pascal  [Select][+][-]
  1. program tstrtest;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. var
  6.   s: AnsiString;
  7. begin
  8.   s := 'Test1234';
  9. end.

The assignment will be code like this:

Code: ASM  [Select][+][-]
  1. # [8] s := 'Test1234';
  2.         leaq    _$TSTRTEST$_Ld1(%rip),%rax
  3.         movq    %rax,%rdx
  4.         leaq    U_$P$TSTRTEST_$$_S(%rip),%rcx
  5.         call    fpc_ansistr_assign

So we'll look at the _$TSTRTEST$_Ld1 symbol:

Code: ASM  [Select][+][-]
  1. .section .rodata.n__$TSTRTEST$_Ld1,"d"
  2.         .balign 8
  3.         .short  0,1
  4.         .long   0
  5.         .quad   -1,8
  6. .globl  _$TSTRTEST$_Ld1
  7. _$TSTRTEST$_Ld1:
  8.         .ascii  "Test1234\000"

And there is the NUL at the end. ;)

Please note that this applies to AnsiString, WideString and UnicodeString, but not to ShortString.

 

TinyPortal © 2005-2018