Lazarus

Programming => Operating Systems => macOS / Mac OS X => Topic started by: JazzMan on September 28, 2010, 04:38:53 pm

Title: Why does tXMLDocument not read UTF-8 characters correct?
Post by: JazzMan on September 28, 2010, 04:38:53 pm
When I use tXMLDocument to parse an UTF-8 XML document, it does not read the extended characters correct. Here is a simple example that illustrates my problem:

txt := tStringStream.Create('<?xml version="1.0" '+
  'encoding="UTF-8" ?>'#13#10'<example>aáäeéë</example>');
showmessage(Txt.DataString);
ReadXMLfile(XMLDocument,txt);
showmessage(XMLDocument.DocumentElement.TextContent);


The first Showmessage shows aáäeéë correct. The second showmessage shows: a??e??

I tried to add the BOM but it didn't change anything. Is this a bug or am I doing something wrong?

Best regards,
Hans
Title: Re: Why does tXMLDocument not read UTF-8 characters correct?
Post by: typo on September 28, 2010, 05:08:02 pm
Maybe you need to convert string with UTF8ToAnsi, UTF8ToSys, etc.

See XMLReader on examples directory.
Title: Re: Why does tXMLDocument not read UTF-8 characters correct?
Post by: theo on September 28, 2010, 05:47:42 pm
showmessage(UTF8Encode(XMLDocument.DocumentElement.TextContent));

Title: Re: Why does tXMLDocument not read UTF-8 characters correct?
Post by: eny on September 28, 2010, 06:59:39 pm
Quote
Re: Why does tXMLDocument not read UTF-8 characters correct?

Luckily it does read UTF-8 characters correctly.

Quote
The first Showmessage shows aáäeéë correct.

No it does not!

Quote
Is this a bug or am I doing something wrong?

The latter. Maybe this background info (http://www.joelonsoftware.com/articles/Unicode.html) is useful.

As for your code snippet: you're creating an invalid encoded text stream.
You say it has UTF-8 encoding but you put non-UTF-8 characters in the string.

The correct way to do this:

Code: [Select]
txt := tStringStream.Create('<?xml version="1.0" encoding="UTF-8" ?>'#13#10'<example>' +
    AnsiToUtf8('aáäeéë') +
    '</example>');      

Note the mandatory AnsiToUtf8(...) function call that encodes the accented characters into a valid UTF-8 byte stream!

Both ShowMessage's now work as expected: the first one shows the UTF-8 encoded bytestream with the funny capital A's and stuff. The second shows the accented characters as expected.
Title: Re: Why does tXMLDocument not read UTF-8 characters correct?
Post by: theo on September 28, 2010, 07:13:02 pm
Note the mandatory AnsiToUtf8(...) function call that encodes the accented characters into a valid UTF-8 byte stream!

No, that's wrong Eny.
The text in the source editor is already UTF-8 by default.
So there is no need to convert here.

But XMLDocument.DocumentElement.TextContent returns a WideString (DomString).
So we have to convert this to UTF-8 like in my example above (UTF8Encode)
Title: Re: Why does tXMLDocument not read UTF-8 characters correct?
Post by: eny on September 28, 2010, 07:57:07 pm
No, that's wrong Eny.

I stand corrected  :-[

I've done quite a lot of text processing with ansi to xml processing.
And when reading external non-UTF-8 data, the ansi2utf8 translation is essential.
The IDE is even more sophisticated than I expected.
Title: Re: Why does tXMLDocument not read UTF-8 characters correct?
Post by: JazzMan on September 29, 2010, 10:36:42 am
Thanks, the UTF8Encode solution solved the problem, but it feels stupid that tXMLDocument converts all strings to UTF-16 on the Mac, because the input string is UTF-8, the IDE uses UTF-8 and the OS uses UTF-8 (and therefore the expected output is also UTF-8).

The first Showmessage does show aáäeéë correct, but it depends on how the .pas file is saved...
If the .pas file is saved WITH BOM then the shown message is "a??e??"
but if it is saved WITHOUT BOM it shows the correct characters.

Is this intended behavior or is it a bug?
(the annoying thing is that the BOM is automatically added once a while, don't know when it happens, and I have to manually remove it to make it work)

BTW, I am on OS X 10.5.8 and the Lazarus revision is 27239M

Best regards,
Hans
TinyPortal © 2005-2018