Recent

Author Topic: [SOLVED] strpos() Function results 2x slower than Perl's index() Function  (Read 6737 times)

Thaddy

  • Hero Member
  • *****
  • Posts: 14367
  • Sensorship about opinions does not belong here.
Compile your Pascal program with -O2 or -O3 option.
Not only that. You have to choose your bottom line CPU and FPU settings (and -O4)
You usually want the complete RTL compiled with these settings.
Depending on application the speed difference can be moderate or frighteningly faster.
« Last Edit: June 02, 2020, 04:00:33 pm by Thaddy »
Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

domibay_hugo

  • New Member
  • *
  • Posts: 37
  • Site Reliabilty / DevOps Engineer at Domibay S.L.
    • GitHub Profile
You also did not compile everything with all compiler settings:
E.g. fpc <yourprogram or library> -Cp<whatever core you want> etc...
I activated compiler settings as suggested
No Compiler Optimization:

$ hyperfine --warmup 3 -r 100 ./runstringsearch
Benchmark #1: ./runstringsearch
  Time (mean ± σ):       9.6 ms ±   0.6 ms    [User: 9.1 ms, System: 0.6 ms]
  Range (min … max):     8.8 ms …  12.4 ms    100 runs

Compiler Optimization -CfAVX2 :

$ hyperfine --warmup 3 -r 100 ./runstringsearch
Benchmark #1: ./runstringsearch
  Time (mean ± σ):       9.5 ms ±   0.4 ms    [User: 8.6 ms, System: 0.8 ms]
  Range (min … max):     9.0 ms …  10.9 ms    100 runs

Compiler Optimizations -CfAVX2 -CpCOREAVX2:

$ hyperfine --warmup 3 -r 100 ./runstringsearch
Benchmark #1: ./runstringsearch
  Time (mean ± σ):       9.5 ms ±   0.4 ms    [User: 8.8 ms, System: 0.8 ms]
  Range (min … max):     9.0 ms …  11.1 ms    100 runs


I conclude that there is no measurable improvement.

So the solution is not on the Assembly Code Level.

Thaddy

  • Hero Member
  • *****
  • Posts: 14367
  • Sensorship about opinions does not belong here.
I conclude that there is no measurable improvement.
Because you did not recompile the rtl with proper settings.
Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

domibay_hugo

  • New Member
  • *
  • Posts: 37
  • Site Reliabilty / DevOps Engineer at Domibay S.L.
    • GitHub Profile
Compile your Pascal program with -O2 or -O3 option.
Not only that. You have to choose your bottom line CPU and FPU settings (and -O4)
You usually want the complete RTL compiled with these settings.
Depending on application the speed difference can be moderate or frighteningly faster.

I activated the -O4(use with care) Setting:

$ ./runstringsearch
file: './small_firewall.log': read do ...
Expression '^.* 11:48:[0-2][0-9] .*[:'][^:']*(udp in|drop)[^:']*[:'].*$': Match Count '15'
BuildPacketList - finished with [ 0 ]!

$ hyperfine --warmup 3 -r 100 ./runstringsearch
Benchmark #1: ./runstringsearch
  Time (mean ± σ):       9.5 ms ±   0.4 ms    [User: 8.8 ms, System: 0.8 ms]
  Range (min … max):     9.0 ms …  10.8 ms    100 runs


Again no measurable improvement.

I think recompiling the RTL with unsafe Settings is not a good idea since it might break my development environment

Thaddy

  • Hero Member
  • *****
  • Posts: 14367
  • Sensorship about opinions does not belong here.
I conclude that there is no measurable improvement.
Because you did not recompile the rtl with proper settings.
You need to compile the rtl with optimal settings... Again the Hives: https://www.youtube.com/watch?v=Uz1Jwyxd4tE

I know this is not easy for most....but it is easy...
Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9864
  • Debugger - SynEdit - and more
    • wiki
If it was just to test the performance of "strpos" under -O4 => copy the code, and compile the copy with O4

domibay_hugo

  • New Member
  • *
  • Posts: 37
  • Site Reliabilty / DevOps Engineer at Domibay S.L.
    • GitHub Profile
I rather think that the solution lies in the Search Algorithm
Comparing the common Brute Force approach:
https://en.wikipedia.org/wiki/Brute-force_search
Against the documented Algorithm of the Perl engine:
https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm

As a side Remark:
Perl performs like this right out of the Box
without recompiling the whole Interpreter Engine

domibay_hugo

  • New Member
  • *
  • Posts: 37
  • Site Reliabilty / DevOps Engineer at Domibay S.L.
    • GitHub Profile
If it was just to test the performance of "strpos" under -O4 => copy the code, and compile the copy with O4
to find a real difference you need to feed enough Test Data into the Algorithm
like explained in the documentation
https://en.wikipedia.org/wiki/Brute-force_search
Quote
While a brute-force search is simple to implement, and will always find a solution if it exists, its cost is proportional to the number of candidate solutions – which in many practical problems tends to grow very quickly as the size of the problem increases (§Combinatorial explosion).
on little data you won't find a difference.
but as the documentation explains the solution is not scalable

The best way to feed enough data into the Algorithm like in a Real-World Production Tool is with a file.

But still any improvements on the code are welcomed.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9864
  • Debugger - SynEdit - and more
    • wiki
Perl performs like this right out of the Box
without recompiling the whole Interpreter Engine
Well we do not know if it is a question of O4 or not.
The suggestion was to test that, to see if it is.

avk

  • Hero Member
  • *****
  • Posts: 752
I think it makes sense to compare implementations of the Boyer-Moore algorithm(strutils/FindMatchesBoyerMoore***).

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11447
  • FPC developer.
The Performance dropped with the PCharSubStringByString_v2() Function
and for the PCharSubStringByString_v4() Function it was just the same as for the Initial Version.

I suppose that the Assembly Code of IndexByte() does just the same as the PCharSubStringByString() Function.

Indexbyte will only be worthwhile for longer strings. You use it to find the first matching char and then do a comparebyte to compare the found ptrs.


domibay_hugo

  • New Member
  • *
  • Posts: 37
  • Site Reliabilty / DevOps Engineer at Domibay S.L.
    • GitHub Profile
I got some new benchmarks for the given Test Data:

$ grep " 11:48:" small_firewall.log|wc -l
32
$ grep " 11:48:" small_firewall.log|grep -ioE "^.* 11:48:[0-2][0-9] .*[:'][^:']*(udp in|drop)[^:']*[:'].*\$"|wc -l
15
$ time grep " 11:48:" big_firewall.log |grep -ioE "^.* 11:48:[0-2][0-9] .*[:'][^:']*(udp in|drop)[^:']*[:'].*\$"|wc -l
0

real   0m0.100s
user   0m0.079s
sys   0m0.025s

the grep is the best performant tool for the search.

Also I implemented the same logic with the Rust Programming Language.
With the big file it achieves the same speed as Perl

$ time target/release/string_search
Expression '(?m)^.* 11:48:[0-2][0-9] .*[:'][^:']*(?i:udp in|drop)[^:']*[:'].*$': Match Count: '0'

real   0m0.205s
user   0m0.142s
sys   0m0.063s


And also in PHP
It performs not so good on the small file but becomes better with the bigger one

$ time ./udp_watch.php
Expression '^.* 11:48:[0-2][0-9] .*[:'][^:']*(udp in|drop)[^:']*[:'].*$': Match Count '15'

real   0m0.053s
user   0m0.035s
sys   0m0.019s
$ time ./udp_watch.php
Expression '^.* 11:48:[0-2][0-9] .*[:'][^:']*(udp in|drop)[^:']*[:'].*$': Match Count '0'

real   0m0.206s
user   0m0.160s
sys   0m0.046s

domibay_hugo

  • New Member
  • *
  • Posts: 37
  • Site Reliabilty / DevOps Engineer at Domibay S.L.
    • GitHub Profile
I think it makes sense to compare implementations of the Boyer-Moore algorithm(strutils/FindMatchesBoyerMoore***).

I searched around and found an Unit at:
https://forum.lazarus.freepascal.org/index.php/topic,23136.0.html

And the Procedure FindMatchesBoyerMooreCaseSensitive()
https://www.freepascal.org/docs-html/rtl/strutils/findmatchesboyermoorecasesensitive.html

The Unit from the Post performs really badly.
The Procedure from strutils did not work when I just included the strutils Unit.
but It started to work when I copied the sourcecode into a new file.

Now the FreePascal Application catches up with the other Implementations:
On the small file:

$ time ./runstringsearch
file: './small_firewall.log': read do ...
Expression '^.* 11:48:[0-2][0-9] .*[:'][^:']*(udp in|drop)[^:']*[:'].*$': Match Count '15'
BuildPacketList - finished with [ 0 ]!

real   0m0.009s
user   0m0.007s
sys   0m0.002s


And on the big file:

$ time ./runstringsearch
file: './big_firewall.log': read do ...
Expression '^.* 11:48:[0-2][0-9] .*[:'][^:']*(udp in|drop)[^:']*[:'].*$': Match Count '0'
BuildPacketList - finished with [ 0 ]!

real   0m0.229s
user   0m0.204s
sys   0m0.026s


I see a clear advantage at the sys time consumtion.

Still looking at the Source Code I noticed that the Character Mapping Table is recreated on each search.
On repetitive search an Search Object could give some advantage in processing.

domibay_hugo

  • New Member
  • *
  • Posts: 37
  • Site Reliabilty / DevOps Engineer at Domibay S.L.
    • GitHub Profile
I rather think that the solution lies in the Search Algorithm
Comparing the common Brute Force approach:
https://en.wikipedia.org/wiki/Brute-force_search
Against the documented Algorithm of the Perl engine:
https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm

I found this explanation on this issue which confirms my impression:
https://stackoverflow.com/questions/25965841/what-algorithm-is-used-for-pos-function-in-pascal

trev

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2020
  • Former Delphi 1-7, 10.2 user
https://www.freepascal.org/docs-html/rtl/strutils/findmatchesboyermoorecasesensitive.html

The last two lines contain typos:

Quote
FindMatchesBoyerMooreCaseInSensitive Find case-sensitive matches of a string using a Boyer-Moore algorithm
SringReplace

 

TinyPortal © 2005-2018