Lazarus

Programming => General => Topic started by: stoffman on April 02, 2023, 10:05:56 am

Title: string.StartsWith vs. UTF8StartsText
Post by: stoffman on April 02, 2023, 10:05:56 am
I have a TStringList populated with valid UTF-8 strings. And I need to find a string that begins with a given text. Now both functions string.StartsWith and UTF8StartsText seems to do what I want BUT UTF8StartsText is about an order of magnitude slower. As this is a hot loop I need it to work as fast as possible.

So is it safe to use string.StartsWith? (I'm not having just Latin text) and if so why UTF8StartsText even exists?

Thanks,
Title: Re: string.StartsWith vs. UTF8StartsText
Post by: howardpc on April 02, 2023, 10:42:40 am
If you check the sources you will see that .StartsWith uses CompareStr, which works on ansistrings.
there is a note in the source as follows:
"   CompareStr compares S1 and S2, the result is the based on
    substraction of the ascii values of the characters in S1 and S2"

Limiting the comparison only to ascii values shows why the routine is order(s) of magnitude faster than a specific utf8 comparison, which has to consider multibyte codepoints, and not just single byte ascii values.
Title: Re: string.StartsWith vs. UTF8StartsText
Post by: Martin_fr on April 02, 2023, 10:44:48 am
Do you need to be case sensitive?  Does 'aBcd' start with 'abc'?

UTF8StartsText  is case insensitive.
string.StartsWith has an optional argument, which defaults to false.

A none case sensitive search is faster.
However if StartsWith is used case insensitive, it may still be faster, as it may only apply this to a..z, but may not compare accented chars or other languages case insensitive (afaik / not 100% sure)





Title: Re: string.StartsWith vs. UTF8StartsText
Post by: JuhaManninen on April 02, 2023, 01:31:50 pm
If you check the sources you will see that .StartsWith uses CompareStr, which works on ansistrings.
It also calls Copy() and thus is not optimized.
If casesensitive comparison is enough, you can also safely use StartsStr() from StrUtils or LazStartsStr() from LazStringUtils.
Caseinsensitive comparison involves complex Unicode rules and is much slower indeed. Still UTF8StartsText is optimized for cases where the text is all-ascii.
Title: Re: string.StartsWith vs. UTF8StartsText
Post by: stoffman on April 02, 2023, 03:31:27 pm
So, after some testing I found out the following:
1. For non-English text, I tested with 2 multibytes languages and string.StartsWith works as good as UTF8StartsText. Is it just luck and there are corner cases which UTF8StartsText works correctly while string.StartsWith doesn't, I don't know.

2. For English I have no doubt that they both works correctly

3. It looks like some functions in LazUTF8 are missing optimizations opportunities.  In  C/Rust and FreePascal's StartsWith everything resolved to fast methods of comparing bytes in memory without copy or new allocations

4. having 4 options to check if a string starts with another string is just  :( but I understand how we come to this..

Thank you all for the help
Title: Re: string.StartsWith vs. UTF8StartsText
Post by: JuhaManninen on April 02, 2023, 05:05:49 pm
@stoffman, you clearly missed the case-insensitive part of the discussion.
UTF8StartsText is case-insensitive. Converting UTF-8 text to lowercase or uppercase involves complex rules, some of which depend on locale. In one country / language an uppercase version of a character can differ from another country / language.
Title: Re: string.StartsWith vs. UTF8StartsText
Post by: stoffman on April 02, 2023, 10:10:46 pm
@JuhaManninen you right. The languages that I tested don't have upper/lower case.

string.Startswith does have an "ignore case" option. but I didn't test that.
TinyPortal © 2005-2018