Forum > General

How do I parse UTF8 strings?

(1/3) > >>

typo:
For example, one can parse an ANSI string like that:


--- Code: ---var
 i :integer;
 s :string;
begin
 for i := 1 to length(s) do
 begin
    if (s[i] = 'a') and (s[i+1] = 'b') then go;
 end;
end;
--- End code ---

But UTF8 strings can not be analyzed this way. If you use pointers, how to have access to the previous and next character?

dfeher:
I think you can parse an UTF8 string the same way, except that you must typecast 'a' and 'b' to WideChar or decoding the UTF8 string with UTF8Decode function and then compare.

Zoran:
You can write this:


--- Code: ---uses
  ..., LCLProc;

...

for I := 1 to UTF8Length(S) do begin
  if (UTF8Copy(S, I, 1) = 'a') and (UTF8Copy(S, I, 1) = 'b') then go;
end;


--- End code ---

For manipulating strings, you should use procedures from LCLProc.
First notice that Length(S) is replaced with UTF8Length(S).
Now, for this purpose, the char S[ I] cannot be used, because it's just one byte, but UTF8Copy(S, I, 1) returns UTF8String whose UTF8Lenth is 1, althought it might contain more than one byte (which means that its Length might be more than 1).

See also other analog procedures and functions in LCLProc unit, like UTF8Insert, UTF8Delete...

theo:
You could use the utf8scanner:
http://wiki.lazarus.freepascal.org/Theodp

typo:

--- Quote ---
--- Code: ---if (UTF8Copy(S, I, 1) = 'a') and (UTF8Copy(S, I, 1) = 'b') then go;
--- End code ---

--- End quote ---

But if I don't know which is the length of 'a', I could not know where to search 'b'.

Navigation

[0] Message Index

[#] Next page

Go to full version