Too bad the iterator syntax does not allow two variables
Now I am starting to worry, is it faster to use pchar + length or pchar + end pchar?
...
The second saves us one decrement, but the comparison needs an implicit subtraction
It is irrelevant! We are talking about maybe +-1 clock cycle, less than a nanosecond.
That could be skipped by preallocating strings of 1, 2, 3, 4 byte length and reusing them.
Actually that is a good idea. It does not scale well with the combining codepoints (TUnicodeCharacterEnumerator) but the most common cases, maybe lengths 1..3, could be optimized.
About TUnicodeCharacterEnumerator:
That will probably be even slower.
Yes, combining many codepoints together is obviously slower than just taking one codepoint.
We need to think about the use case of this. Why would people need to iterate over a string? One character at a time in one string.
That is the whole point of an iterator and it is needed sometimes. For example it was a valid solution for a question in this thread, as an alternative to a regexpr.
If somebody does not want to iterate over a string then he will not use the iterator obviously.
That is too limited for most use cases. They probably need something else
The most useful it is for low-level implementations like getting a utf8 string length, and then it needs to be as fast as possible
I am not sure what you mean.
Taking care of a whole Unicode "character" including its combining diacritical marks is better and
less limited than dealing with just a codepoint.
Actually I cannot think of any situation where I would want to handle an alphabet 'a' and its accent mark '´' separately. They belong together. That's why combining codepoints were invented! They are meant to combine.
The only worry is that the rules for combining codepoints are more complex than diacritical marks. TUnicodeCharacterEnumerator could give a false feeling of security to a user. He could think it works always while it may not. That's why I asked here about the ramifications.
Let's have an example of combining codepoints. Try with a button and a memo on a form.
First, the
for...in iterator currently only goes through codepoints.
procedure TForm1.Button1Click(Sender: TObject);
const
Combining = 'ÓÓỐỐỚỚÒÒỒỒỎỎỔỔỞỞỌỌBあC'#$CC#$81#$CC#$B2;
var
ch: String;
begin
for ch in Combining do
Memo1.Lines.Add(ch);
end;
Not good! Now let's try with TUnicodeCharacterEnumerator explicitly.
procedure TForm1.Button1Click(Sender: TObject);
const
Combining = 'ÓÓỐỐỚỚÒÒỒỒỎỎỔỔỞỞỌỌBあC'#$CC#$81#$CC#$B2;
var
ucIter: TUnicodeCharacterEnumerator;
begin
ucIter := TUnicodeCharacterEnumerator.Create(Combining);
while ucIter.MoveNext do
Memo1.Lines.Add(ucIter.Current);
ucIter.Free;
end;
Better!
Note, the last 'C' has 2 extra codepoints connected. There could be more, Unicode does not limit the number.
Note2: SynEdit is not able to show the text correctly but that is another bug.
Note3: UTF-8 / UTF-16 encodings make no difference here. Combining codepoints goes beyond encodings.
Note4: The enumerators and helper functions in unit LazUnicode are encoding agnostic. They work equally well with UTF-8 and UTF-16.
If I don't get any well justified objections, I will change the iterator to use TUnicodeCharacterEnumerator by default.