Recent

Author Topic: string.StartsWith vs. UTF8StartsText  (Read 880 times)

stoffman

  • Jr. Member
  • **
  • Posts: 67
string.StartsWith vs. UTF8StartsText
« on: April 02, 2023, 10:05:56 am »
I have a TStringList populated with valid UTF-8 strings. And I need to find a string that begins with a given text. Now both functions string.StartsWith and UTF8StartsText seems to do what I want BUT UTF8StartsText is about an order of magnitude slower. As this is a hot loop I need it to work as fast as possible.

So is it safe to use string.StartsWith? (I'm not having just Latin text) and if so why UTF8StartsText even exists?

Thanks,
« Last Edit: April 02, 2023, 10:25:28 am by stoffman »

howardpc

  • Hero Member
  • *****
  • Posts: 4144
Re: string.StartsWith vs. UTF8StartsText
« Reply #1 on: April 02, 2023, 10:42:40 am »
If you check the sources you will see that .StartsWith uses CompareStr, which works on ansistrings.
there is a note in the source as follows:
"   CompareStr compares S1 and S2, the result is the based on
    substraction of the ascii values of the characters in S1 and S2"

Limiting the comparison only to ascii values shows why the routine is order(s) of magnitude faster than a specific utf8 comparison, which has to consider multibyte codepoints, and not just single byte ascii values.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9791
  • Debugger - SynEdit - and more
    • wiki
Re: string.StartsWith vs. UTF8StartsText
« Reply #2 on: April 02, 2023, 10:44:48 am »
Do you need to be case sensitive?  Does 'aBcd' start with 'abc'?

UTF8StartsText  is case insensitive.
string.StartsWith has an optional argument, which defaults to false.

A none case sensitive search is faster.
However if StartsWith is used case insensitive, it may still be faster, as it may only apply this to a..z, but may not compare accented chars or other languages case insensitive (afaik / not 100% sure)






JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
Re: string.StartsWith vs. UTF8StartsText
« Reply #3 on: April 02, 2023, 01:31:50 pm »
If you check the sources you will see that .StartsWith uses CompareStr, which works on ansistrings.
It also calls Copy() and thus is not optimized.
If casesensitive comparison is enough, you can also safely use StartsStr() from StrUtils or LazStartsStr() from LazStringUtils.
Caseinsensitive comparison involves complex Unicode rules and is much slower indeed. Still UTF8StartsText is optimized for cases where the text is all-ascii.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

stoffman

  • Jr. Member
  • **
  • Posts: 67
Re: string.StartsWith vs. UTF8StartsText
« Reply #4 on: April 02, 2023, 03:31:27 pm »
So, after some testing I found out the following:
1. For non-English text, I tested with 2 multibytes languages and string.StartsWith works as good as UTF8StartsText. Is it just luck and there are corner cases which UTF8StartsText works correctly while string.StartsWith doesn't, I don't know.

2. For English I have no doubt that they both works correctly

3. It looks like some functions in LazUTF8 are missing optimizations opportunities.  In  C/Rust and FreePascal's StartsWith everything resolved to fast methods of comparing bytes in memory without copy or new allocations

4. having 4 options to check if a string starts with another string is just  :( but I understand how we come to this..

Thank you all for the help

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
Re: string.StartsWith vs. UTF8StartsText
« Reply #5 on: April 02, 2023, 05:05:49 pm »
@stoffman, you clearly missed the case-insensitive part of the discussion.
UTF8StartsText is case-insensitive. Converting UTF-8 text to lowercase or uppercase involves complex rules, some of which depend on locale. In one country / language an uppercase version of a character can differ from another country / language.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

stoffman

  • Jr. Member
  • **
  • Posts: 67
Re: string.StartsWith vs. UTF8StartsText
« Reply #6 on: April 02, 2023, 10:10:46 pm »
@JuhaManninen you right. The languages that I tested don't have upper/lower case.

string.Startswith does have an "ignore case" option. but I didn't test that.

 

TinyPortal © 2005-2018