Recent

Author Topic: Why does tXMLDocument not read UTF-8 characters correct?  (Read 14421 times)

JazzMan

  • New Member
  • *
  • Posts: 31
    • http://www.earmaster.com
Why does tXMLDocument not read UTF-8 characters correct?
« on: September 28, 2010, 04:38:53 pm »
When I use tXMLDocument to parse an UTF-8 XML document, it does not read the extended characters correct. Here is a simple example that illustrates my problem:

txt := tStringStream.Create('<?xml version="1.0" '+
  'encoding="UTF-8" ?>'#13#10'<example>aáäeéë</example>');
showmessage(Txt.DataString);
ReadXMLfile(XMLDocument,txt);
showmessage(XMLDocument.DocumentElement.TextContent);


The first Showmessage shows aáäeéë correct. The second showmessage shows: a??e??

I tried to add the BOM but it didn't change anything. Is this a bug or am I doing something wrong?

Best regards,
Hans

typo

  • Hero Member
  • *****
  • Posts: 3051
Re: Why does tXMLDocument not read UTF-8 characters correct?
« Reply #1 on: September 28, 2010, 05:08:02 pm »
Maybe you need to convert string with UTF8ToAnsi, UTF8ToSys, etc.

See XMLReader on examples directory.

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1890
Re: Why does tXMLDocument not read UTF-8 characters correct?
« Reply #2 on: September 28, 2010, 05:47:42 pm »
showmessage(UTF8Encode(XMLDocument.DocumentElement.TextContent));


eny

  • Hero Member
  • *****
  • Posts: 1587
Re: Why does tXMLDocument not read UTF-8 characters correct?
« Reply #3 on: September 28, 2010, 06:59:39 pm »
Quote
Re: Why does tXMLDocument not read UTF-8 characters correct?

Luckily it does read UTF-8 characters correctly.

Quote
The first Showmessage shows aáäeéë correct.

No it does not!

Quote
Is this a bug or am I doing something wrong?

The latter. Maybe this background info is useful.

As for your code snippet: you're creating an invalid encoded text stream.
You say it has UTF-8 encoding but you put non-UTF-8 characters in the string.

The correct way to do this:

Code: [Select]
txt := tStringStream.Create('<?xml version="1.0" encoding="UTF-8" ?>'#13#10'<example>' +
    AnsiToUtf8('aáäeéë') +
    '</example>');      

Note the mandatory AnsiToUtf8(...) function call that encodes the accented characters into a valid UTF-8 byte stream!

Both ShowMessage's now work as expected: the first one shows the UTF-8 encoded bytestream with the funny capital A's and stuff. The second shows the accented characters as expected.
« Last Edit: September 28, 2010, 07:04:25 pm by eny »
All posts based on: Win10 (Win64); Lazarus 1.8.0 'stable' (#56594 win64) unless specified otherwise...

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1890
Re: Why does tXMLDocument not read UTF-8 characters correct?
« Reply #4 on: September 28, 2010, 07:13:02 pm »
Note the mandatory AnsiToUtf8(...) function call that encodes the accented characters into a valid UTF-8 byte stream!

No, that's wrong Eny.
The text in the source editor is already UTF-8 by default.
So there is no need to convert here.

But XMLDocument.DocumentElement.TextContent returns a WideString (DomString).
So we have to convert this to UTF-8 like in my example above (UTF8Encode)

eny

  • Hero Member
  • *****
  • Posts: 1587
Re: Why does tXMLDocument not read UTF-8 characters correct?
« Reply #5 on: September 28, 2010, 07:57:07 pm »
No, that's wrong Eny.

I stand corrected  :-[

I've done quite a lot of text processing with ansi to xml processing.
And when reading external non-UTF-8 data, the ansi2utf8 translation is essential.
The IDE is even more sophisticated than I expected.
All posts based on: Win10 (Win64); Lazarus 1.8.0 'stable' (#56594 win64) unless specified otherwise...

JazzMan

  • New Member
  • *
  • Posts: 31
    • http://www.earmaster.com
Re: Why does tXMLDocument not read UTF-8 characters correct?
« Reply #6 on: September 29, 2010, 10:36:42 am »
Thanks, the UTF8Encode solution solved the problem, but it feels stupid that tXMLDocument converts all strings to UTF-16 on the Mac, because the input string is UTF-8, the IDE uses UTF-8 and the OS uses UTF-8 (and therefore the expected output is also UTF-8).

The first Showmessage does show aáäeéë correct, but it depends on how the .pas file is saved...
If the .pas file is saved WITH BOM then the shown message is "a??e??"
but if it is saved WITHOUT BOM it shows the correct characters.

Is this intended behavior or is it a bug?
(the annoying thing is that the BOM is automatically added once a while, don't know when it happens, and I have to manually remove it to make it work)

BTW, I am on OS X 10.5.8 and the Lazarus revision is 27239M

Best regards,
Hans