Recent

Author Topic: How do I parse UTF8 strings?  (Read 14405 times)

typo

  • Hero Member
  • *****
  • Posts: 3051
How do I parse UTF8 strings?
« on: March 28, 2010, 03:42:03 am »
For example, one can parse an ANSI string like that:

Code: [Select]
var
 i :integer;
 s :string;
begin
 for i := 1 to length(s) do
 begin
    if (s[i] = 'a') and (s[i+1] = 'b') then go;
 end;
end;

But UTF8 strings can not be analyzed this way. If you use pointers, how to have access to the previous and next character?
« Last Edit: March 28, 2010, 03:45:56 am by typo »

dfeher

  • New Member
  • *
  • Posts: 19
Re: How do I parse UTF8 strings?
« Reply #1 on: March 28, 2010, 10:03:34 am »
I think you can parse an UTF8 string the same way, except that you must typecast 'a' and 'b' to WideChar or decoding the UTF8 string with UTF8Decode function and then compare.
« Last Edit: March 28, 2010, 10:05:30 am by dfeher »

Zoran

  • Hero Member
  • *****
  • Posts: 1988
    • http://wiki.lazarus.freepascal.org/User:Zoran
Re: How do I parse UTF8 strings?
« Reply #2 on: March 28, 2010, 10:40:57 am »
You can write this:

Code: [Select]
uses
  ..., LCLProc;

...

for I := 1 to UTF8Length(S) do begin
  if (UTF8Copy(S, I, 1) = 'a') and (UTF8Copy(S, I, 1) = 'b') then go;
end;


For manipulating strings, you should use procedures from LCLProc.
First notice that Length(S) is replaced with UTF8Length(S).
Now, for this purpose, the char S[ I] cannot be used, because it's just one byte, but UTF8Copy(S, I, 1) returns UTF8String whose UTF8Lenth is 1, althought it might contain more than one byte (which means that its Length might be more than 1).

See also other analog procedures and functions in LCLProc unit, like UTF8Insert, UTF8Delete...
« Last Edit: March 28, 2010, 10:45:55 am by Zoran »
Swan, ZX Spectrum emulator https://github.com/zoran-vucenovic/swan

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1934
Re: How do I parse UTF8 strings?
« Reply #3 on: March 28, 2010, 11:26:57 am »
You could use the utf8scanner:
http://wiki.lazarus.freepascal.org/Theodp

typo

  • Hero Member
  • *****
  • Posts: 3051
Re: How do I parse UTF8 strings?
« Reply #4 on: March 28, 2010, 12:28:27 pm »
Quote
Code: [Select]
if (UTF8Copy(S, I, 1) = 'a') and (UTF8Copy(S, I, 1) = 'b') then go;

But if I don't know which is the length of 'a', I could not know where to search 'b'.

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1934
Re: How do I parse UTF8 strings?
« Reply #5 on: March 28, 2010, 12:51:26 pm »
Quote
Code: [Select]
if (UTF8Copy(S, I, 1) = 'a') and (UTF8Copy(S, I, 1) = 'b') then go;

But if I don't know which is the length of 'a', I could not know where to search 'b'.

It should work. "I" is the char index (code point) not the byte index.
But utf8scanner is easier to work with and faster.

typo

  • Hero Member
  • *****
  • Posts: 3051
Re: How do I parse UTF8 strings?
« Reply #6 on: March 28, 2010, 01:10:07 pm »
I installed it, but Lazarus does not show me the corresponding set of components.

Zoran

  • Hero Member
  • *****
  • Posts: 1988
    • http://wiki.lazarus.freepascal.org/User:Zoran
Re: How do I parse UTF8 strings?
« Reply #7 on: March 28, 2010, 01:38:57 pm »
Quote
Code: [Select]
if (UTF8Copy(S, I, 1) = 'a') and (UTF8Copy(S, I, 1) = 'b') then go;

But if I don't know which is the length of 'a', I could not know where to search 'b'.

Oh, I see that I put "I" twice, instead of "I + 1" in second term, sorry.
Of course, it should be UTF8Copy(S, I + 1, 1) = b.

It is a solution which will work for you.
I beleive that using the package that Theo recomends is faster, I haven't tried it, but I will.
Swan, ZX Spectrum emulator https://github.com/zoran-vucenovic/swan

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1934
Re: How do I parse UTF8 strings?
« Reply #8 on: March 28, 2010, 02:39:23 pm »
I installed it, but Lazarus does not show me the corresponding set of components.

It's not a visual component. Add the dependency "utf8tools" to the project (in project inspector) and add "uses utf8scanner" to your unit.

typo

  • Hero Member
  • *****
  • Posts: 3051
Re: How do I parse UTF8 strings?
« Reply #9 on: March 29, 2010, 10:26:54 pm »
Quote
"I" is the char index (code point) not the byte index.

If so, "I" should be incremented by UTF8Length('a'), which does not occur. I am not understanding. To me, it should be:

Code: [Select]
if (UTF8Copy(s, i, 1) = 'a')
              and (UTF8Copy(s, i + Length('a'), 1) = 'b') then go;   

See the example in Wiki, using PChar:

Code: [Select]
uses LCLProc;
...
procedure IterateUTF8Characters(const AnUTF8String: string);
var
  p: PChar;
  unicode: Cardinal;
  CharLen: integer;
begin
  p:=PChar(AnUTF8String);
  repeat
    unicode:=UTF8CharacterToUnicode(p,CharLen);
    writeln('Unicode=',unicode);
    inc(p,CharLen);
  until (CharLen=0) or (unicode=0);
end;


I think also loop for is inappropriate in this case.
« Last Edit: March 30, 2010, 12:58:36 am by typo »

typo

  • Hero Member
  • *****
  • Posts: 3051
Re: How do I parse UTF8 strings?
« Reply #10 on: March 30, 2010, 01:00:35 am »
Well, I tested this code and it works:

         
Code: [Select]
repeat
  unicode := UTF8CharacterToUnicode(p, charlen);

  if  (utf8copy(s, i, 1) = 'c')
  and (utf8copy(s, i+length('c'), 1) = 'd')              // caractere à direita
  and (utf8copy(s, i+length('c')+length('d'), 1) = 'e')  // segundo caractere à direita
  and (utf8copy(s, i-length('b'), 1) = 'b')              // caractere à esquerda
  and (utf8copy(s, i-length('b')-length('a'), 1) = 'a')  // segundo caractere à esquerda
  then
    showmessage('abcde');

  inc(p, charlen);
  inc(i, charlen);
until (charlen = 0) or (unicode = 0); 
« Last Edit: March 30, 2010, 01:38:18 am by typo »

 

TinyPortal © 2005-2018