Recent

Author Topic: How to search UTF8 characters with regular expressions  (Read 21813 times)

tomitomy

  • Sr. Member
  • ****
  • Posts: 251
Re: How to search UTF8 characters with regular expressions
« Reply #30 on: October 28, 2017, 03:50:06 pm »
I use "lazUnicode" and "for... In" to traverse a string, but it doesn't work:

Code: Pascal  [Select][+][-]
  1. uses
  2.   lazUnicode;
  3.  
  4. procedure TForm1.Button1Click(Sender: TObject);
  5. var
  6.   Str: String;
  7.   ch: Char;
  8. begin
  9.   Str := '一二三四五六七八九十';
  10.   for ch in Str do begin
  11.     writeln(ch);
  12.   end;
  13. end;

howardpc

  • Hero Member
  • *****
  • Posts: 4144
Re: How to search UTF8 characters with regular expressions
« Reply #31 on: October 28, 2017, 04:23:41 pm »
Juha gave you the code for this earlier.
Change the type of ch from Char to String.

tomitomy

  • Sr. Member
  • ****
  • Posts: 251
Re: How to search UTF8 characters with regular expressions
« Reply #32 on: October 28, 2017, 04:53:07 pm »
Juha gave you the code for this earlier.
Change the type of ch from Char to String.

I'm Sorry, I did not read the code seriously before. Thank you for reminding me.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4574
  • I like bugs.
Re: How to search UTF8 characters with regular expressions
« Reply #33 on: October 29, 2017, 09:40:48 am »
A better and cleaner and more readable solution is to use an iterator defined in unit LazUnicode.
I must confess that I never have used this unit. What confused me is the "Unicode" in its name which makes it appear to be something like a helper unit for Delphi's kind of Unicode (UTF16), but actually it is for UTF8.
No. It is encoding agnostic. Works also with UTF-16.

Quote
Why didn't you put these routines into LazUTF8?
It uses LazUTF8 when the encoding is UTF-8 and LazUTF16 when it is UTF-16.
@wp, did you look at the unit? I guess not. Things are explained in its comment header. They are explained also in wiki:
 http://wiki.freepascal.org/Unicode_Support_in_Lazarus#CodePoint_functions_for_encoding_agnostic_code
I have advertised the unit in mailing list and in this forum but apparently not enough.

Quote
I know that the term "Unicode" contains UTF8 as well as UTF16 as well as UTF32, but when people talk of "unicode" they usually mean the UTF-16 of Delphi only. Damn confusing...
Who talks of "unicode" that way? Then you must correct him.
The confusing part is naming a type UnicodeString instead of UTF16String but most people know the history and the facts now. I don't believe it is a big source of confusion any more.
« Last Edit: October 29, 2017, 10:05:50 am by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4574
  • I like bugs.
Re: How to search UTF8 characters with regular expressions
« Reply #34 on: October 29, 2017, 10:03:24 am »
Regarding this syntax:
Code: Pascal  [Select][+][-]
  1. for ch in S do
  2.  ...
Then it becomes a codepoint comparison ch = $4e94. It is faster than a string comparison, too.
No. It is not faster if you must convert the UTF-8 codepoint into an integer.
Besides, from where do you get a number like $4e94? You must search from some internet Unicode page. It is less intuitive than using the textual representation.

Quote
Alternatively the iterator could use pchar + length. Then it does not need to allocate a new string and can be used with a pos for pchars. I try to replace all temporary string usages with pchar + length.
Yes, that is faster. However it requires two variables instead of one.
The "for ... in" iterator calls SetLength() for each "character" and copies the bytes. It provides a very intuitive syntax and is fast enough for most purposes.
« Last Edit: October 29, 2017, 10:33:31 am by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4574
  • I like bugs.
Re: How to search UTF8 characters with regular expressions
« Reply #35 on: October 29, 2017, 10:12:47 am »
I am confused, I do not know "String, WideString, UnicodeString" What is the difference, I do not know "{$ CODEPAGE UTF8}" and "use LazUTF8" and "use LazUnicode" changed what? Why it is so complicated?
It is so complicated because of decisions made in 3 projects: Delphi, FPC and Lazarus.
Please read this carefully and you understand the decisions made for Lazarus better:
 http://wiki.freepascal.org/Unicode_Support_in_Lazarus

I use "lazUnicode" and "for... In" to traverse a string, but it doesn't work:
With "ch: Char;" it actually works but it iterates codeunits which means 8-bit char with UTF-8.
It could be useful if you are interested in ASCII chars only.

Unicode experts, what do you say about using the TUnicodeCharacterEnumerator by default for the "for ... in" syntax?
Then LazUnicode would have:
Code: Pascal  [Select][+][-]
  1. operator Enumerator(A: String): TUnicodeCharacterEnumerator;
instead of
Code: Pascal  [Select][+][-]
  1. operator Enumerator(A: String): TCodePointEnumerator;
It would take care of combining diacritical marks, accent marks etc.
Would it cause harm?
My knowledge of Unicode is also limited. I know codeunits and codepoints and combining codepoints somehow but there are more complex rules which I don't know well.
My understanding is that support for combining diacritical marks would cover the rules used for most western languages.
« Last Edit: October 29, 2017, 10:39:16 am by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

BeniBela

  • Hero Member
  • *****
  • Posts: 927
    • homepage
Re: How to search UTF8 characters with regular expressions
« Reply #36 on: October 31, 2017, 04:05:17 pm »
No. It is not faster if you must convert the UTF-8 codepoint into an integer.
Besides, from where do you get a number like $4e94? You must search from some internet Unicode page. It is less intuitive than using the textual representation.

Memory allocation is always the slowest

When you check a range (cp >= $4e94) and (cp <= $4e9f) that is clearer than (s >= '五') and (s <= '亟')

Quote
Alternatively the iterator could use pchar + length. Then it does not need to allocate a new string and can be used with a pos for pchars. I try to replace all temporary string usages with pchar + length.
Yes, that is faster. However it requires two variables instead of one.

Too bad the iterator syntax does not allow two variables

Now I am starting to  worry, is it faster to use pchar + length or pchar + end pchar?

Code: [Select]
while len > 0 do begin
  dec(len);
  inc(p);
end;

or

Code: [Select]
while p < endpchar do begin
  inc(p);
end;

The second saves us one decrement, but the comparison needs an implicit subtraction

Perhaps the first is better with local variables and the second with an iterator?

The "for ... in" iterator calls SetLength() for each "character"

That could be skipped by preallocating strings of 1, 2, 3, 4 byte length and reusing them.



Unicode experts, what do you say about using the TUnicodeCharacterEnumerator by default for the "for ... in" syntax?
Then LazUnicode would have:
Code: Pascal  [Select][+][-]
  1. operator Enumerator(A: String): TUnicodeCharacterEnumerator;
instead of
Code: Pascal  [Select][+][-]
  1. operator Enumerator(A: String): TCodePointEnumerator;
It would take care of combining diacritical marks, accent marks etc.
Would it cause harm?
My knowledge of Unicode is also limited. I know codeunits and codepoints and combining codepoints somehow but there are more complex rules which I don't know well.
My understanding is that support for combining diacritical marks would cover the rules used for most western languages.

That will  probably be even slower

We need to think about the use case of this. Why would people need to iterate over a string? One character at a time in one string.

That is too limited for most use cases. They probably need something else

The most useful it is for low-level implementations like getting a utf8 string length, and then it needs to be as fast as possible

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4574
  • I like bugs.
Re: How to search UTF8 characters with regular expressions
« Reply #37 on: October 31, 2017, 08:31:15 pm »
Too bad the iterator syntax does not allow two variables
Now I am starting to  worry, is it faster to use pchar + length or pchar + end pchar?
...
The second saves us one decrement, but the comparison needs an implicit subtraction
It is irrelevant! We are talking about maybe +-1 clock cycle, less than a nanosecond.

Quote
That could be skipped by preallocating strings of 1, 2, 3, 4 byte length and reusing them.
Actually that is a good idea. It does not scale well with the combining codepoints (TUnicodeCharacterEnumerator) but the most common cases, maybe lengths 1..3, could be optimized.

About  TUnicodeCharacterEnumerator:
Quote
That will  probably be even slower.
Yes, combining many codepoints together is obviously slower than just taking one codepoint.

Quote
We need to think about the use case of this. Why would people need to iterate over a string? One character at a time in one string.
That is the whole point of an iterator and it is needed sometimes. For example it was a valid solution for a question in this thread, as an alternative to a regexpr.
If somebody does not want to iterate over a string then he will not use the iterator obviously.

Quote
That is too limited for most use cases. They probably need something else
The most useful it is for low-level implementations like getting a utf8 string length, and then it needs to be as fast as possible
I am not sure what you mean.
Taking care of a whole Unicode "character" including its combining diacritical marks is better and less limited than dealing with just a codepoint.
Actually I cannot think of any situation where I would want to handle an alphabet 'a' and its accent mark '´' separately. They belong together. That's why combining codepoints were invented! They are meant to combine.
The only worry is that the rules for combining codepoints are more complex than diacritical marks. TUnicodeCharacterEnumerator could give a false feeling of security to a user. He could think it works always while it may not. That's why I asked here about the ramifications.

Let's have an example of combining codepoints. Try with a button and a memo on a form.
First, the for...in iterator currently only goes through codepoints.
Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button1Click(Sender: TObject);
  2. const
  3.   Combining = 'ÓÓỐỐỚỚÒÒỒỒỎỎỔỔỞỞỌỌBあC'#$CC#$81#$CC#$B2;
  4. var
  5.   ch: String;
  6. begin
  7.   for ch in Combining do
  8.     Memo1.Lines.Add(ch);
  9. end;
Not good! Now let's try with TUnicodeCharacterEnumerator explicitly.
Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button1Click(Sender: TObject);
  2. const
  3.   Combining = 'ÓÓỐỐỚỚÒÒỒỒỎỎỔỔỞỞỌỌBあC'#$CC#$81#$CC#$B2;
  4. var
  5.   ucIter: TUnicodeCharacterEnumerator;
  6. begin
  7.   ucIter := TUnicodeCharacterEnumerator.Create(Combining);
  8.   while ucIter.MoveNext do
  9.     Memo1.Lines.Add(ucIter.Current);
  10.   ucIter.Free;
  11. end;
Better!
Note, the last 'C' has 2 extra codepoints connected. There could be more, Unicode does not limit the number.
Note2: SynEdit is not able to show the text correctly but that is another bug.
Note3: UTF-8 / UTF-16 encodings make no difference here. Combining codepoints goes beyond encodings.
Note4: The enumerators and helper functions in unit LazUnicode are encoding agnostic. They work equally well with UTF-8 and UTF-16.

If I don't get any well justified objections, I will change the iterator to use TUnicodeCharacterEnumerator by default.
« Last Edit: November 03, 2017, 05:31:57 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

tomitomy

  • Sr. Member
  • ****
  • Posts: 251
Re: How to search UTF8 characters with regular expressions
« Reply #38 on: November 01, 2017, 12:45:24 pm »
I am confused, I do not know "String, WideString, UnicodeString" What is the difference, I do not know "{$ CODEPAGE UTF8}" and "use LazUTF8" and "use LazUnicode" changed what? Why it is so complicated?
It is so complicated because of decisions made in 3 projects: Delphi, FPC and Lazarus.
Please read this carefully and you understand the decisions made for Lazarus better:
 http://wiki.freepascal.org/Unicode_Support_in_Lazarus

Thank you JuhaManninen, I'm going to read it.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4574
  • I like bugs.
Re: How to search UTF8 characters with regular expressions
« Reply #39 on: November 03, 2017, 05:42:43 pm »
If I don't get any well justified objections, I will change the iterator to use TUnicodeCharacterEnumerator by default.
I changed it in r56260.
I also moved the console demo to examples directory and added a new GUI demo. The GUI project is for Lazarus but it can easily be ported to Delphi. It should work by just creating a Delphi project around the .lpr file. The main unit's form file already uses .dfm extension and has no Lazarus specific layout anchors.
Earlier in r56259 I optimized the strings returned by enumerators. I did not measure the speedup. If somebody wants to measure, please tell me the results.
Everybody please test.

P.S.
The new GUI demo adds lines to a Memo. With QT bindings that is horribly slow, taking many seconds for the shortish text. With GTK2 it comes immediately. Must figure out later what is going on there.
« Last Edit: November 03, 2017, 05:50:15 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

 

TinyPortal © 2005-2018