Recent

Author Topic: [SOLVED ] Can't display Unicode on Windows with strings read from Indy 10  (Read 19337 times)

MementoMojito

  • Jr. Member
  • **
  • Posts: 63
Re: Can't display Unicode on Win 8 (Laz 1.6/fpc 3)
« Reply #15 on: May 27, 2016, 02:29:38 am »
Hi Remy,

Thank you really much for the in depth explanations. I think I am starting to get it now.
Can you correct me if I am wrong but basically Indy will convert any strings it reads to the encoding defined with
 
Code: Pascal  [Select][+][-]
  1. IOHandler.DefStringEncoding
and then convert it to UTF-16 no matter what. Then if no encoding has been defined the output string passed by ReadLn() will be osDefault or if defined via
Code: Pascal  [Select][+][-]
  1. IOHandler.DefAnsiEncoding

then the given encoding?

So to make it short ; DefStringEncoding is the encoding I am expecting to read from the server, DefAnsiEncoding the encoding returned by ReadLn() and in between in it will always be UTF-16 ( - as I am guessing that would be the only encoding that would allow no loss? -) ?

Quote
Do your UI controls expect UTF-8 encoded strings?  I am not familiar with how Lazarus works, but from your descriptions, it sounds like things are working fine when you force the socket output to be UTF-8.

To be fair my UI doesn't expect anything as I started to code it on Linux and everything always been fine (including Kanjis) without changing the default charset for any of the components. I guess it would be because the default charset in Linux supports it. So it has been a bit of a tricky problem only on Windows. My code on Linux doesn't define either DefStringEncoding or DefAnsiEncodind (that I have just discovered tonight), it just works.
Now regarding my Kanji issue on Windows I will definitely try by using a different type for my variable.
« Last Edit: May 27, 2016, 02:34:11 am by MementoMojito »

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1314
    • Lebeau Software
Re: Can't display Unicode on Win 8 (Laz 1.6/fpc 3)
« Reply #16 on: May 27, 2016, 03:03:55 am »
Can you correct me if I am wrong but basically Indy will convert any strings it reads to the encoding defined with
 
Code: Pascal  [Select][+][-]
  1. IOHandler.DefStringEncoding
and then convert it to UTF-16 no matter what. Then if no encoding has been defined the output string passed by ReadLn() will be osDefault or if defined via
Code: Pascal  [Select][+][-]
  1. IOHandler.DefAnsiEncoding

then the given encoding?

When reading an incoming string, the bytes are decoded straight to UTF-16 using DefStringEncoding (unless overridden by the reading method's optional AByteEncoding parameter).  By default, DefStringEncoding is ASCII.  If no encoding is specified, the IdGlobal.GIdDefaultTextEncoding encoding is used (also ASCII by default).

If the string being returned is an AnsiString, the UTF-16 is converted to Ansi using DefAnsiEncoding (unless overridden by the reading method's optional ADestEncoding parameter).  By default, DefAnsiEncoding is OSDefault (user's locale on Windows, UTF-8 on other systems).  If no encoding is specified, the IdGlobal.GIdDefaultTextEncoding encoding is used.

The reverse is true with sending strings:

When sending an outgoing string, the characters are encoded from UTF-16 to bytes using DefStringEncoding (unless overridden by the sending method's optional AByteEncoding parameter).  If no encoding is specified, the IdGlobal.GIdDefaultTextEncoding encoding is used.

If the string being sent is an AnsiString, the Ansi data is converted to UTF-16 using DefAnsiEncoding (unless overridden by the sending method's optional ASrcEncoding parameter).  If no encoding is specified, the IdGlobal.GIdDefaultTextEncoding encoding is used.

So to make it short ; DefStringEncoding is the encoding I am expecting from the server, DefAnsiEncoding the encoding returned by ReadLn() and in between in it will always be UTF-16 ( - as I am guessing that would be the only encoding that would allow no loss? -) ?

Basically, yes.

To be fair my UI doesn't expect anything as I started to code it on Linux and everything always been fine (including Kanjis) without changing the default charset for any of the components. I guess it would be because the default charset in Linux supports it. So it has been a bit of a tricky problem only on Windows.

Welcome to the wonderful world of legacy Ansi handling :'( All the reason why the majority of the world has moved to UTFs.  If you recompile Indy to enable {$MODE DelphiUnicode} or {$MODESWITCH UnicodeStrings} in IdCompilerDefines.inc, the String type becomes UnicodeString, and DefAnsiEncoding and ASrcEncoding/ADestEncoding parameters disappear.  So things will probably start working OK (I haven't tested it, though).

My code on Linux doesn't define either DefStringEncoding or DefAnsiEncodind (that I have just discovered tonight), it just works.

Even though DefAnsiEncoding defaults to UTF-8 on Linux, I would expect data loss if you don't set DefStringEncoding to UTF-8 to match, since DefStringEncoding is always US-ASCII by default, so kanjis and such would get lost before being converted to Ansi.
« Last Edit: May 27, 2016, 03:06:35 am by Remy Lebeau »
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4474
  • I like bugs.
Re: Can't display Unicode on Win 8 (Laz 1.6/fpc 3)
« Reply #17 on: May 27, 2016, 01:24:45 pm »
Even though DefAnsiEncoding defaults to UTF-8 on Linux, I would expect data loss if you don't set DefStringEncoding to UTF-8 to match, since DefStringEncoding is always US-ASCII by default, so kanjis and such would get lost before being converted to Ansi.

Yes, use Unicode everywhere if possible. It sets us free from the horrors of system codepage encodings and the infamous question marks.
When Indy is configured to use UTF-8 as default encoding, your code should work without any explicit conversions. I am surprised there were so many problems.
All conversions between Unicode encodings (UTF-8 <> UTF-16) are lossless and thus safe.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

MementoMojito

  • Jr. Member
  • **
  • Posts: 63
Re: Can't display Unicode on Win 8 (Laz 1.6/fpc 3)
« Reply #18 on: May 27, 2016, 02:42:10 pm »
Ok so good news from my side, it's finally solved.
I have finally got a chance to replace the following code:

Code: Pascal  [Select][+][-]
  1. thCl.IOHandler.defStringEncoding := IndyTextEncoding_UTF8;
  2. TrackName := thCl.IOHandler.readLn(#10);
  3. TrackName := ConvertEncoding(TrackName,GuessEncoding(TrackName),EncodingUTF8);
  4.  

By the one suggested by Remy:

Code: Pascal  [Select][+][-]
  1. thCl.IOHandler.DefStringEncoding := IndyTextEncoding_UTF8;
  2. thCl.IOHandler.DefAnsiEncoding := IndyTextEncoding_UTF8;
  3. TrackName := thCl.IOHandler.readLn(#10);
  4.  

and my kanjis are back, so it was after all utf-8 I suppose. Now I guess the conversion UTF-8 > UTF-16 > CP1252 > UTF-8 was inducing data loss hence the question marks for the kanjis with the previous method.
Thank you so much to everyone for the help and all the detailed explanations, that was a bit painful but still interesting to have a more indepth look at how Indy handles encoding on different platforms.
And I will definitely keep in mind the AnsiString/UTF-16 encoding problem if I ever need to manipulate UTF-16.


N.B: Changed the topic to reflect more accuratly the problem.
« Last Edit: May 27, 2016, 02:45:01 pm by MementoMojito »

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1314
    • Lebeau Software
Re: Can't display Unicode on Win 8 (Laz 1.6/fpc 3)
« Reply #19 on: May 27, 2016, 07:25:50 pm »
Now I guess the conversion UTF-8 > UTF-16 > CP1252 > UTF-8 was inducing data loss hence the question marks for the kanjis with the previous method.

Yes, because cp1252 only supports a small subset of Unicode characters, and that does not include kanjis (see the table here).
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1314
    • Lebeau Software
Re: Can't display Unicode on Win 8 (Laz 1.6/fpc 3)
« Reply #20 on: May 27, 2016, 07:32:56 pm »
When Indy is configured to use UTF-8 as default encoding, your code should work without any explicit conversions. I am surprised there were so many problems.

There were problems because a Unicode->Ansi conversion was involved.

All conversions between Unicode encodings (UTF-8 <> UTF-16) are lossless and thus safe.

Yes, and in fact, if both DefStringEncoding and DefAnsiEncoding are set to the same encoding, there won't even be a UTF-16 conversion performed.  Since the received bytes are to be interpreted in the same encoding that the returned AnsiString is to be encoded as, Indy will simply copy the bytes as-is directly into the AnsiString.
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

 

TinyPortal © 2005-2018