Recent

Author Topic: [SOLVED] strpos() Function results 2x slower than Perl's index() Function  (Read 6674 times)

domibay_hugo

  • New Member
  • *
  • Posts: 37
  • Site Reliabilty / DevOps Engineer at Domibay S.L.
    • GitHub Profile
I tried to replace a Perl Script with a FreePascal Application.
The Goal of this Application is to read a Log File and find matching Log Entries for a specific time window by parsing the Time References.
I got 2 files as Samples:

$ ls -lah *.log
-rw-r--r--. 1 usr15 usr15 8,0M jun  1 11:20 small_firewall.log
-rw-r--r--. 1 usr15 usr15 115M jun  1 11:20 big_firewall.log

They contain Entries like:

Aug  2 14:53:43 <server_name> kernel: IN=eth0 OUT= MAC=40:40:c9:aa:57:d3:c4:64:13:43:69:ff:08:00 SRC=193.111.152.234 DST=<server_ip> LEN=40 TOS=0x00 PREC=0x00 TTL=75 ID=39000 DF PROTO=TCP SPT=60703 DPT=21 WINDOW=29200 RES=0x00 SYN URGP=0
Aug  2 14:53:43 <server_name> kernel: IN=eth0 OUT= MAC=40:40:c9:aa:57:d3:c4:64:13:43:69:ff:08:00 SRC=193.111.152.234 DST=<server_ip>  LEN=40 TOS=0x00 PREC=0x00 TTL=64 ID=6107 DF PROTO=TCP SPT=51567 DPT=21 WINDOW=29200 RES=0x00 SYN URGP=0

and

May 11 08:00:37 <server_name> kernel: Firewall:zn12zn2:DROP:IN=eth0 OUT= MAC= SRC=213.166.70.66 DST=<server_ip> LEN=40 TOS=0x00 PREC=0x00 TTL=243 ID=21562 PROTO=TCP SPT=52760 DPT=36692 WINDOW=1024 RES=0x00 SYN URGP=0
May 11 08:00:38 <server_name> kernel: Firewall:zn12zn2:DROP:IN=eth0 OUT= MAC= SRC=195.54.160.105 DST=<server_ip> LEN=40 TOS=0x00 PREC=0x00 TTL=239 ID=32820 PROTO=TCP SPT=57781 DPT=73 WINDOW=1024 RES=0x00 SYN URGP=0


On the Small Log File it is already a bit slower than the Perl Script
The FreePascal performs like this:

$ time ./runstringsearch
file: './small_firewall.log': read do ...
Expression '^.* 11:48:[0-2][0-9] .*[:'][^:']*(udp in|drop)[^:']*[:'].*$': Match Count '15'
BuildPacketList - finished with [ 0 ]!

real      0m0.018s
user     0m0.016s
sys       0m0.002s

While the Perl does the same job like this:

$ time ./udp_watch.pl -a=l0:10:50 -f small_firewall.log -s vpn --dir=./ -s drop -t 5
Check: [ ok ]
Check finished with [ 0 ]

real   0m0.013s
user   0m0.010s
sys           0m0.002s

On the bigger Log File it becomes even more drastical:
The FreePascal Application:

$ time ./runstringsearch
file: './big_firewall.log': read do ...
Expression '^.* 11:48:[0-2][0-9] .*[:'][^:']*(udp in|drop)[^:']*[:'].*$': Match Count '0'
BuildPacketList - finished with [ 0 ]!

real   0m0.400s
user   0m0.372s
sys          0m0.028s

While the Perl Script does the same Job:

$ time ./udp_watch.pl -a=l0:10:50 -f big_firewall.log -s vpn --dir=./ -s drop -t 5
Check: [ ok ]
Check finished with [ 0 ]

real   0m0.200s
user   0m0.181s
sys           0m0.019s


I tried to replace the strpos() Function by an own Implementation but it did not change anything or made it even worse.
My FreePascal Prototype is this:
Code: Pascal  [Select][+][-]
  1. program runstringsearch;
  2.  
  3. uses
  4.   Classes, SysUtils
  5.   , RegExpr;
  6.  
  7.  
  8.  
  9. //==============================================================================
  10. // Auxiliary Functions
  11.  
  12.  
  13. function PCharAtPos(const ssource: String; iposition: Integer): PChar;
  14. begin
  15.   if (iposition > -1)
  16.     and (iposition < Length(ssource)) then
  17.   begin
  18.     Result := PChar(ssource) + iposition;
  19.   end
  20.   else
  21.     Result := Nil;
  22. end;
  23.  
  24.  
  25. function PCharStartsWith(const ssource: PChar; isourcelength: Cardinal
  26.   ; const ssearch: PChar; isearchlength: Cardinal): Boolean; (*inline;*)
  27. var
  28.   pssrc, psend, pssrh: PChar;
  29. begin
  30.   pssrc := ssource;
  31.   pssrh := ssearch;
  32.   psend := pssrc + isourcelength;
  33.  
  34.   //Shorter Srouce String doesn't match
  35.   Result := (isourcelength >= isearchlength);
  36.  
  37.   //Empty Strings don't match
  38.   Result := Result
  39.     and ((pssrc^ <> #0)
  40.       or (pssrc < psend)
  41.       or (pssrh^ <> #0));
  42.  
  43.   //First Byte must match
  44.   Result := Result and (pssrc^ = pssrh^);
  45.  
  46.   while Result
  47.       and (pssrh^ <> #0)
  48.       and (pssrc < psend)
  49.       and (pssrc^ <> #0) do
  50.   begin
  51.     if pssrc^ <> pssrh^ then
  52.       Result := False
  53.     else
  54.     begin
  55.       inc(pssrc);
  56.       inc(pssrh);
  57.     end;
  58.   end;  //while Result do
  59. end;
  60.  
  61.  
  62. function PCharSubStringRByChar(const ssource: PChar; const ssearch: Char; const ssearchend: PChar = Nil): PChar;
  63. var
  64.   pssrh: PChar;
  65. begin
  66.   pssrh := ssearchend;
  67.   Result := Nil;
  68.  
  69.   if pssrh = Nil then
  70.   begin
  71.     pssrh := ssource + strlen(ssource);
  72.   end;
  73.  
  74.   while (Result = Nil)
  75.     and (pssrh >= ssource) do
  76.   begin
  77.     if pssrh^ = ssearch then
  78.       Result := pssrh
  79.     else if pssrh > ssource then
  80.       dec(pssrh);
  81.  
  82.   end;  //while (Result = Nil) and (pssrh >= ssource) do
  83. end;
  84.  
  85.  
  86. function PCharSubStringByString(const ssource: PChar; isourcelength: Cardinal
  87.   ; const ssearch: PChar; isearchlength: Cardinal): PChar;
  88. var
  89.   pssrc, psend: PChar;
  90.   isrhps(*, icmpln, imtch*): Cardinal;
  91.   bsrhgo: Boolean;
  92. begin
  93.   Result := Nil;
  94.  
  95.   pssrc := ssource;
  96.   psend := pssrc + isourcelength;
  97.   isrhps := 0;
  98.   //icmpln := isourcelength;
  99.  
  100.   //Skip on Empty Strings
  101.   bsrhgo := (pssrc < psend);
  102.  
  103.   while bsrhgo
  104.     and (pssrc < psend) do
  105.   begin
  106.     if pssrc^ = ssearch^ then
  107.     begin
  108.       (* icmpln := isourcelength - isrhps;
  109.  
  110.       if icmpln > isearchlength then
  111.         icmpln := isearchlength; *)
  112.  
  113.       //if pssrc[0..(icmpln - 1)] = ssearch[0..(isearchlength - 1)] then
  114.       //imtch := ;
  115.       //if CompareByte(ssearch^, pssrc^, isearchlength) = 0 then
  116.       if PCharStartsWith(pssrc, isourcelength - isrhps, ssearch, isearchlength) then
  117.       begin
  118.         Result := pssrc;
  119.         bsrhgo := False;
  120.       end;
  121.  
  122.  
  123.     end;  //if pssrc^ = ssearch^ then
  124.  
  125.     inc(pssrc);
  126.     inc(isrhps);
  127.  
  128.     //String End reached
  129.     //bsrhgo := ;
  130.  
  131.   end;  //while bsrhgo do
  132. end;
  133.  
  134.  
  135. function StringRPosByChar(const ssource: String; const ssearch: Char): Integer;
  136. var
  137.   psstrt, pssrh: PChar;
  138.   isrhps: Integer;
  139. begin
  140.   psstrt := PChar(ssource);
  141.   isrhps := Length(ssource);
  142.  
  143.   pssrh := psstrt + isrhps;
  144.   Result := -1;
  145.  
  146.   while (Result = -1)
  147.     and (pssrh >= psstrt) do
  148.   begin
  149.     if pssrh^ = ssearch then
  150.       Result := isrhps
  151.     else if pssrh > psstrt then
  152.     begin
  153.       dec(pssrh);
  154.       dec(isrhps);
  155.     end; //if pssrh^ = ssearch then
  156.   end;  //while (Result = -1) and (pssrh >= psstrt) do
  157. end;
  158.  
  159.  
  160. function BuildPacketList: Integer;
  161. var
  162.   chkexp: TRegExpr;
  163.   stmfl: TFileStream;
  164.   //stmchkpkg: TStringStream;
  165.   lstchklns: TStringList;
  166.   schkpkg: String;
  167.   sflcnk: String;
  168.   pschkpkg: PChar;
  169.   schklnsrh: String;
  170.   pslstln, pschktm, pschktmps, pschktmstrt: PChar;
  171.   ichkpkgln, ischktmln(*, ichktmps*), ilstlnps: Integer;
  172.   icnksz: Integer;
  173.   bchkstrt, bchkend: Boolean;
  174.   bdbg: Boolean;
  175. begin
  176.   //------------------------------------
  177.   //Read the Firewall Log
  178.  
  179.   Result := 0;
  180.  
  181.   bdbg := false;
  182.  
  183.   try
  184.     //stmfl := TFileStream.Create('./small_firewall.log', (fmOpenRead or fmShareDenyWrite));
  185.     stmfl := TFileStream.Create('./big_firewall.log', (fmOpenRead or fmShareDenyWrite));
  186.   except
  187.     on e: Exception do
  188.     begin
  189.       WriteLn('File ', chr(39), stmfl.FileName, chr(39), ': File Open failed with [', e.HelpContext, ']!');
  190.       WriteLn('Message: ', chr(39), e.Message, chr(39));
  191.  
  192.       Result := 2;
  193.     end
  194.     else
  195.     begin
  196.       WriteLn('File ', chr(39), stmfl.FileName, chr(39), ': File Open failed!');
  197.       WriteLn('Message [-1]: ', chr(39), 'Unknown Error', chr(39));
  198.  
  199.       Result := 2;
  200.     end;
  201.   end;  //except
  202.  
  203.   lstchklns := TStringList.Create;
  204.  
  205.   icnksz := 32768;
  206.  
  207.   sflcnk := '';
  208.   schkpkg := '';
  209.  
  210.   SetLength(sflcnk, icnksz);
  211.   SetLength(schkpkg, icnksz);
  212.  
  213.   //stmchkpkg := TStringStream.Create('');
  214.  
  215.   //stmchkpkg.Size := Self.fl.ChunkSize;
  216.  
  217.   bchkstrt := False;
  218.   bchkend := not (Result = 0);
  219.  
  220.   pschktm := PChar(' 11:48:');
  221.   ischktmln := Length(' 11:48:') + 1;
  222.  
  223.   schklnsrh := '^.* 11:48:[0-2][0-9] .*[:''][^:'']*(udp in|drop)[^:'']*[:''].*$';
  224.  
  225.   try  //except
  226.     chkexp := TRegExpr.Create;
  227.  
  228.     chkexp.Expression := schklnsrh;
  229.     chkexp.ModifierI := True;
  230.     chkexp.ModifierM := True;  //allow to work with ^$
  231.  
  232.     chkexp.ModifierS:= False; //don't catch all text by .*
  233.     chkexp.ModifierX:= False; //don't ingore spaces
  234.  
  235.     //Build the Regular Expression Control Structure
  236.     chkexp.Compile;
  237.   except
  238.     on e: ERegExpr do
  239.     begin
  240.       WriteLn('Expression ', chr(39), chkexp.Expression, chr(39), ': Compile failed with [', e.ErrorCode, ']!');
  241.       WriteLn('Error at Position ', chr(39), e.CompilerErrorPos, chr(39), ': ', chr(39), e.Message, chr(39));
  242.  
  243.       if Result < 1 then
  244.         Result := 1;
  245.  
  246.       //Stop Reading
  247.       bchkend:= True;
  248.     end;
  249.   end;  //except
  250.  
  251.   if not bchkend then
  252.   begin
  253.     WriteLn('file: ', chr(39), stmfl.FileName, chr(39), ': read do ...');
  254.   end;  //if not bchkend then
  255.  
  256.   try  //except
  257.     while not bchkend
  258.       and (stmfl.Read(sflcnk[1], icnksz - 1) > 0) do
  259.     begin
  260.       pslstln := Nil;
  261.  
  262.       if sflcnk <> '' then
  263.       begin
  264.         //Skip the Last Line
  265.         ilstlnps := StringRPosByChar(sflcnk, chr(10));
  266.  
  267.         if ilstlnps <> -1 then
  268.         begin
  269.           schkpkg := schkpkg + sflcnk[1..ilstlnps] + chr(0);
  270.           //stmchkpkg.Write(sflcnk[1], ilstlnps);
  271.  
  272.           //Terminate the String for PChar
  273.           //stmchkpkg.WriteByte(0);
  274.  
  275.           inc(ilstlnps);
  276.  
  277.           pslstln := PCharAtPos(sflcnk, ilstlnps);
  278.         end
  279.         else  //The Chunk doesn't have a Linebreak
  280.         begin
  281.           schkpkg := schkpkg + sflcnk;
  282.           //stmchkpkg.WriteString(sflcnk);
  283.         end;  //if pslstln <> nil then
  284.  
  285.         pschkpkg := PChar(schkpkg);
  286.         //pschkpkg := PChar(stmchkpkg.DataString);
  287.         ichkpkgln := strlen(pschkpkg);
  288.         //ichkpkgln := stmchkpkg.Position - 1;
  289.  
  290.         if bdbg then
  291.           WriteLn('chk cnk 1: ', chr(39), pschkpkg, chr(39));
  292.  
  293.         //ichktmps := Pos(pschktm, pschkpkg);
  294.         pschktmps := strpos(pschkpkg, pschktm);
  295.         //pschktmps := PCharSubStringByString(pschkpkg, ichkpkgln, pschktm, ischktmln);
  296.         pschktmps := Nil;
  297.  
  298.         if pschktmps <> Nil then
  299.         begin
  300.           pschktmstrt := PCharSubStringRByChar(pschkpkg, chr(10), pschktmps);
  301.  
  302.           if pschktmstrt <> Nil then
  303.             inc(pschktmstrt)
  304.           else
  305.             pschktmstrt := pschktmps;
  306.  
  307.           //Start Time Check
  308.           bchkstrt := True;
  309.  
  310.           if bdbg then
  311.           begin
  312.             WriteLn('chk cnk 2: ', chr(39), pschktmstrt, chr(39));
  313.             Write('exp: ', chr(39), chkexp.Expression, chr(39), ' match - ');
  314.           end;
  315.  
  316.           try  //except
  317.             (*
  318.             if chkexp.Exec(pschktmstrt) then
  319.             begin
  320.               if bdbg then
  321.                 WriteLn('HIT');
  322.  
  323.               repeat
  324.                 //Add all Matches for Checking
  325.                 lstchklns.Add(chkexp.Match[0]);
  326.               until not chkexp.ExecNext;
  327.  
  328.               if bdbg then
  329.                 WriteLn('lst chk lns (cnt: ', chr(39), lstchklns.Count, chr(39), '): '
  330.                   , chr(39), lstchklns.CommaText, chr(39));
  331.  
  332.             end
  333.             else  //Regular Expression does not match
  334.             begin
  335.               if bdbg then
  336.                 WriteLn('MISS');
  337.  
  338.             end;  //if chkexp.Exec(pschktmstrt) then
  339.             *)
  340.           except
  341.             on e: ERegExpr do
  342.             begin
  343.               if bdbg then
  344.                 WriteLn('CRASH');
  345.  
  346.               WriteLn('Expression ', chr(39), schklnsrh, chr(39), ': Match failed with [', e.ErrorCode, ']!');
  347.               WriteLn('Error at Position ', chr(39), e.CompilerErrorPos, chr(39), ': ', chr(39), e.HelpContext, chr(39));
  348.  
  349.               if Result < 1 then
  350.                 Result := 1;
  351.  
  352.               //Stop Reading
  353.               bchkend:= True;
  354.             end;
  355.             on e: Exception do
  356.             begin
  357.               WriteLn('File ', chr(39), stmfl.FileName, chr(39), ': File Read failed with [', e.HelpContext, ']!');
  358.               WriteLn('Message: ', chr(39), e.Message, chr(39));
  359.  
  360.               if Result < 1 then
  361.                 Result := 1;
  362.  
  363.               //Stop Reading
  364.               bchkend:= True;
  365.             end;  //on e: ERegExpr do
  366.           end;  //except
  367.  
  368.         end
  369.         else if bchkstrt then
  370.         begin
  371.           //Stop Time Check
  372.           bchkend := True;
  373.         end;  //if strpos(pschkpkg, pschktm) <> Nil then
  374.       end;  //if psrdcnk^ <> '' then
  375.  
  376.       //Reset the Check String
  377.       //stmchkpkg.Position := 0;
  378.  
  379.       if pslstln <> Nil then
  380.       begin
  381.         //Readd the Last Line
  382.         schkpkg := pslstln;
  383.         //stmchkpkg.WriteString(pslstln);
  384.       end
  385.       else
  386.       begin
  387.         schkpkg := '';
  388.  
  389.       end;  //if pslstln <> nil then
  390.     end;  //while not bchkend
  391.           //  and (stmfl.Read(sflcnk[1], 32768 - 1) > 0) do
  392.   except
  393.     on e: Exception do
  394.     begin
  395.       WriteLn('File ', chr(39), stmfl.FileName, chr(39), ': File Read failed with [',  e.HelpContext, ']!');
  396.       WriteLn('Message [', e.HelpContext, ']: ', e.Message);
  397.  
  398.       Result := 1;
  399.     end
  400.     else
  401.     begin
  402.       WriteLn('File ', chr(39), stmfl.FileName, chr(39), ': File Read failed!');
  403.       WriteLn('Message [-1]: ', chr(39), 'Unknown Error', chr(39));
  404.  
  405.       Result := 1;
  406.     end;  //on e: Exception do
  407.   end;  //except
  408.  
  409.   WriteLn('Expression ', chr(39), schklnsrh, chr(39), ': Match Count ', chr(39), lstchklns.Count, chr(39));
  410.  
  411.   //------------------------
  412.   //Clean-Up
  413.  
  414.   chkexp.Free;
  415.   //stmchkpkg.Free;
  416.  
  417.   lstchklns.Free;
  418.  
  419.   stmfl.Free;
  420. end;
  421.  
  422.  
  423.  
  424. //==============================================================================
  425. // Executing Section
  426.  
  427.  
  428. const
  429.   SMETHOD = 'BuildPacketList';
  430. var
  431.   ierr: Integer;
  432. begin
  433.     try
  434.       ierr := BuildPacketList;
  435.     except
  436.       on e: Exception do
  437.       begin
  438.           WriteLn(SMETHOD, ' - failed!');
  439.           WriteLn('Message [', e.HelpContext, ']: ', e.Message);
  440.       end
  441.       else
  442.       begin
  443.           WriteLn(SMETHOD, ' - failed!');
  444.           WriteLn('Message [-1]: ', chr(39), 'Unknown Error', chr(39));
  445.       end;  //on e: Exception do
  446.  
  447.       ierr := 1;
  448.     end;
  449.  
  450.   WriteLn(SMETHOD, ' - finished with [', ierr, ']!');
  451.  
  452. end.
  453.  

The Regex Match does not make any difference.
Even commenting it out the Run Time stays the same:

$ time ./runstringsearch
file: './big_firewall.log': read do ...
Expression '^.* 11:48:[0-2][0-9] .*[:'][^:']*(udp in|drop)[^:']*[:'].*$': Match Count '0'
BuildPacketList - finished with [ 0 ]!

real   0m0.448s
user   0m0.432s
sys           0m0.015s


But the time drops drastically on disabling the strpos() Call:

$ time ./runstringsearch
file: './big_firewall.log': read do ...
Expression '^.* 11:48:[0-2][0-9] .*[:'][^:']*(udp in|drop)[^:']*[:'].*$': Match Count '0'
BuildPacketList - finished with [ 0 ]!

real   0m0.130s
user   0m0.109s
sys           0m0.021s


So I was looking into the Perl Source Code and found that they use the C Programming Language Function memchr()
FreePascal has a Function that works a bit similar the CompareByte()  but it does not really do the same and even results slower than Function PCharStartsWith() from the Prototype.

I wonder whether there is anything similar to the memchr() Function or if it can be developped:
https://www.man7.org/linux/man-pages/man3/memchr.3.html
« Last Edit: October 13, 2020, 11:57:55 am by domibay_hugo »

julkas

  • Guest
Compile your Pascal program with -O2 or -O3 option.

domibay_hugo

  • New Member
  • *
  • Posts: 37
  • Site Reliabilty / DevOps Engineer at Domibay S.L.
    • GitHub Profile
Compile your Pascal program with -O2 or -O3 option.
Yes, all measurements were done with the Release Version with -O3 enabled.
Unless it would not be a fair comparison.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11383
  • FPC developer.
So how does the Perl index function work? Is it a simple loop with a character compare too, or does it use special primitives?

FPC also has some basic primitives, so a possible improvement could be using primitives in FPC too, the indexbyte() function that most FPC character search routines in the RTL are based on.

Note that anything regex based will always be a tad slow, since that is an interpretive solution. Though some regex engines optimize some simple cases with dedicated code.
« Last Edit: June 01, 2020, 04:21:26 pm by marcov »

domibay_hugo

  • New Member
  • *
  • Posts: 37
  • Site Reliabilty / DevOps Engineer at Domibay S.L.
    • GitHub Profile
I supposed that the TStringStream could efficiently handle the String Data but it resulted slower than a bare String Variable.
I also observed that the TStringStream seams to drop its Memory and than reallocate it as shown in the strace log:

read(3, "3 05:48:20 lon08 kernel: IN=eth0"..., 32767) = 32767
read(3, "3 05:48:34 lon08 kernel: IN=eth0"..., 32767) = 32767
munmap(0x7fed5b148000, 32768)           = 0
read(3, "Aug  3 05:48:48 lon08 kernel: IN"..., 32767) = 32767
read(3, "RGP=0 \nAug  3 05:49:03 lon08 ker"..., 32767) = 32767


read(3, "kernel: IN=eth0 OUT= MAC=40:40:c"..., 32767) = 32767
mmap(NULL, 32768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fed5b148000
read(3, "0 \nAug  3 06:14:25 lon08 kernel:"..., 32767) = 32767

Which does not occur when using the String Variable

domibay_hugo

  • New Member
  • *
  • Posts: 37
  • Site Reliabilty / DevOps Engineer at Domibay S.L.
    • GitHub Profile
So how does the Perl index function work? Is it a simple loop with a character compare too, or does it use special primitives?

FPC also has some basic primitives, so a possible improvement could be using primitives in FPC too, the indexbyte() function that most FPC character search routines in the RTL are based on.

As far as I understood they compare 2 bytes at once with the memchr Function.

The IndexByte() Function seems to me very promissing.

Quote
Note that anything regex based will always be a tad slow, since that is an interpretive solution. Though some regex engines optimize some simple cases with dedicated code.

I am aware that Regex Search is known to be slower. Therefore I am trying to speed up the processing with a previous substr() check.
« Last Edit: June 01, 2020, 04:34:47 pm by domibay_hugo »

Thaddy

  • Hero Member
  • *****
  • Posts: 14213
  • Probably until I exterminate Putin.
StrPos() is a C support function and indeed is much slower than Pascal native Pos() family.
StrPos() also FAILS on true Pascal type strings. (Since length is known and embedded #0's are supported in Pascal string types)
It is geared towards PChars. (Which is famously not the brightest ideafrom curly bracket language designers....).
If you have the need for StrPos(), you are not using Pascal features, just a rude version to make some Pascal interact with the above silly design of C like "strings". Technically C as a language does not even know strings. Do away with any of the curly brackets and you are fine as far as "strings" are concerned.

Basically: Pascal does not need to check for buffer overruns.
Algorithmic, walking a Pascal string or a C style buffer should be the same in O notation: O(n). But technically, true Pascal stringtypes - should and do -  have an advantage.
And Strpos() is not really Pascal.....
« Last Edit: June 01, 2020, 04:44:31 pm by Thaddy »
Specialize a type, not a var.

domibay_hugo

  • New Member
  • *
  • Posts: 37
  • Site Reliabilty / DevOps Engineer at Domibay S.L.
    • GitHub Profile
StrPos() is a C support function and indeed is much slower than Pascal native Pos() family.
StrPos() also FAILS on true Pascal type strings. (Since length is known and embedded #0's are supported in Pascal string types)
It is geared towards PChars. (Which is famously not the brightest ideafrom curly bracket language designers....).
If you have the need for StrPos(), you are not using Pascal features, just a rude version to make some Pascal interact with the above silly design of C like "strings". Technically C as a language does not even know strings. Do away with any of the curly brackets and you are fine as far as "strings" are concerned.

Obviously it would be nicer to be able work with native String Variables. But I'm unsure if the Syntax "pascalstring[ipos] = searchbyte" is the same as fast as "pcharstring^ = searchbyte". Or does it not activate serveral Range Checks which slow down the processing even more?

Further more most String has the unconvienent habit to copy strings in Memory which know as another slow-down.

So with PChar I can access the String without copying it over and over again.

Even nice would be the access to String Slices without copying them.

Thaddy

  • Hero Member
  • *****
  • Posts: 14213
  • Probably until I exterminate Putin.
Then you are fooled by similarities in syntax.
You can run similar code with {$pointermath on}, which gives results that are closer to C-style  [] indexing, but slightly different.
A PChar buffer can simply be indexed like in C, but using Pointer syntax.
In theory the same speed. Do not confuse language syntax with feature support, because you can NOT assume that to be the case. You should investigate what the syntax is for the languages that you want to compare the same thing.... That is a common mistake to make...
Specialize a type, not a var.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11383
  • FPC developer.
As far as I understood they compare 2 bytes at once with the memchr Function.

So Perl doesn't use a naive own implemented function, but optimized functions. Then you must do the same with FPC.

Quote
The IndexByte() Function seems to me very promissing.

For two byte characters there is IndexWord.


BeniBela

  • Hero Member
  • *****
  • Posts: 905
    • homepage
The real fast functions probably use AVX

domibay_hugo

  • New Member
  • *
  • Posts: 37
  • Site Reliabilty / DevOps Engineer at Domibay S.L.
    • GitHub Profile
The Perl Interpreter is just simple C Language Code. There is no Black Magic Assembly Code in there.
https://www.mkssoftware.com/docs/perl/pod/perlapi.asp#item_fbm_instr
This Documentation states they implemented the "Boyer-Moore algorithm"
https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm
Quote
The key features of the algorithm are to match on the tail of the pattern rather than the head, and to skip along the text in jumps of multiple characters rather than searching every single character in the text.

The Black Magic consists simply in not matching every single Character of the Search against every single Character in the Source
like Pos() and all developed approaches do.

In deed when debugging I found that the PCharStartsWith() Function was triggered tens of thousands of times on every Text Chunk.
« Last Edit: June 02, 2020, 03:46:50 pm by domibay_hugo »

domibay_hugo

  • New Member
  • *
  • Posts: 37
  • Site Reliabilty / DevOps Engineer at Domibay S.L.
    • GitHub Profile
FPC also has some basic primitives, so a possible improvement could be using primitives in FPC too, the indexbyte() function that most FPC character search routines in the RTL are based on.

I implemented 2 Function to test the Performance on the given Test Data
Comparing 2 Bytes with the IndexByte() Function
Code: Pascal  [Select][+][-]
  1. function PCharSubStringByString_v2(const ssource: PChar; isourcelength: Cardinal
  2.   ; const ssearch: PChar; isearchlength: Cardinal): PChar;
  3. var
  4.   pssrc, psend: PChar;
  5.   isrhps: Cardinal;
  6.   imtchps: Integer;
  7.   bsrhgo: Boolean;
  8. begin
  9.   Result := Nil;
  10.  
  11.   pssrc := ssource;
  12.   psend := pssrc + isourcelength;
  13.   isrhps := 0;
  14.   //icmpln := isourcelength;
  15.  
  16.   //Skip on Empty Strings
  17.   bsrhgo := (pssrc < psend);
  18.  
  19.   while bsrhgo
  20.     and (pssrc < psend) do
  21.   begin
  22.     if pssrc[1] <> #0 then
  23.     begin
  24.       //More than 1 Byte left
  25.       imtchps := IndexByte(pssrc^, 2, Byte(ssearch^));
  26.  
  27.       if imtchps <> -1 then
  28.       begin
  29.         inc(pssrc, imtchps);
  30.         inc(isrhps, imtchps);
  31.  
  32.         if PCharStartsWith(pssrc, isourcelength - isrhps, ssearch, isearchlength) then
  33.         begin
  34.           Result := pssrc;
  35.           bsrhgo := False;
  36.         end;
  37.  
  38.         inc(pssrc, 2 - imtchps);
  39.         inc(isrhps, 2 - imtchps);
  40.       end
  41.       else  //Bytes do not match
  42.       begin
  43.         inc(pssrc, 2);
  44.         inc(isrhps, 2);
  45.       end;  //if imtchps <> -1 then
  46.     end
  47.     else  //Last Byte
  48.     begin
  49.       if pssrc^ = ssearch^ then
  50.       begin
  51.         if PCharStartsWith(pssrc, isourcelength - isrhps, ssearch, isearchlength) then
  52.         begin
  53.           Result := pssrc;
  54.           bsrhgo := False;
  55.         end;
  56.       end;  //if pssrc^ = ssearch^ then
  57.  
  58.       inc(pssrc);
  59.       inc(isrhps);
  60.     end;  //if pssrc[1] = #0 then
  61.   end;  //while bsrhgo do
  62. end;
  63.  

And comparing 4 Bytes with the IndexByte() Function
Code: Pascal  [Select][+][-]
  1. function PCharSubStringByString_v4(const ssource: PChar; isourcelength: Cardinal
  2.   ; const ssearch: PChar; isearchlength: Cardinal): PChar;
  3. var
  4.   pssrc, psend: PChar;
  5.   icmpln: Cardinal;
  6.   imtchps: Integer;
  7.   bsrhgo: Boolean;
  8. begin
  9.   Result := Nil;
  10.  
  11.   pssrc := ssource;
  12.   psend := pssrc + isourcelength;
  13.   icmpln := 4;
  14.  
  15.   //Skip on Empty Strings
  16.   bsrhgo := (pssrc < psend);
  17.  
  18.   while bsrhgo
  19.     and (pssrc < psend) do
  20.   begin
  21.     if icmpln > (psend - pssrc) then
  22.       icmpln := psend - pssrc;
  23.  
  24.     imtchps := IndexByte(pssrc^, icmpln, Byte(ssearch^));
  25.  
  26.     if imtchps <> -1 then
  27.     begin
  28.       inc(pssrc, imtchps);
  29.  
  30.       if PCharStartsWith(pssrc, psend - pssrc, ssearch, isearchlength) then
  31.       begin
  32.         Result := pssrc;
  33.         bsrhgo := False;
  34.       end;
  35.  
  36.       inc(pssrc, icmpln - imtchps);
  37.     end
  38.     else  //Bytes do not match
  39.     begin
  40.       inc(pssrc, icmpln);
  41.     end;  //if imtchps <> -1 then
  42.   end;  //while bsrhgo do
  43. end;
  44.  

The Performance dropped with the PCharSubStringByString_v2() Function
and for the PCharSubStringByString_v4() Function it was just the same as for the Initial Version.

I suppose that the Assembly Code of IndexByte() does just the same as the PCharSubStringByString() Function.

domibay_hugo

  • New Member
  • *
  • Posts: 37
  • Site Reliabilty / DevOps Engineer at Domibay S.L.
    • GitHub Profile
StrPos() is a C support function and indeed is much slower than Pascal native Pos() family.
StrPos() also FAILS on true Pascal type strings. (Since length is known and embedded #0's are supported in Pascal string types)
It is geared towards PChars. (Which is famously not the brightest ideafrom curly bracket language designers....).
If you have the need for StrPos(), you are not using Pascal features, just a rude version to make some Pascal interact with the above silly design of C like "strings". Technically C as a language does not even know strings. Do away with any of the curly brackets and you are fine as far as "strings" are concerned.

Basically: Pascal does not need to check for buffer overruns.
Algorithmic, walking a Pascal string or a C style buffer should be the same in O notation: O(n). But technically, true Pascal stringtypes - should and do -  have an advantage.
And Strpos() is not really Pascal.....

I did also a try with replacing the PChar Search Functions with the native Pos() Function in:
Code: Pascal  [Select][+][-]
  1.         if Pos(pschktm, pschkpkg) > 0 then
  2.         begin
  3.           pschktmstrt := pschkpkg;          
  4.        
  5.           // [...] Do the Regex Match
  6.  
  7.         end;
  8.  

As a result the Code slowed down another 3x than the original Version:

$ time ./runstringsearch
file: './small_firewall.log': read do ...
Expression '^.* 11:48:[0-2][0-9] .*[:'][^:']*(udp in|drop)[^:']*[:'].*$': Match Count '15'
BuildPacketList - finished with [ 0 ]!

real   0m0.031s
user   0m0.030s
sys   0m0.001s

Thaddy

  • Hero Member
  • *****
  • Posts: 14213
  • Probably until I exterminate Putin.
You also did not compile everything with all compiler settings:
E.g. fpc <yourprogram or library> -Cp<whatever core you want> etc...
Those noob script languages usually rely on optimizations from an external library written in a real compiled language. So if you want to compare that, you need to understand the optimization levels of what the imported code really does...
« Last Edit: June 02, 2020, 03:48:08 pm by Thaddy »
Specialize a type, not a var.

 

TinyPortal © 2005-2018