Recent

Author Topic: [SOLVED ] Can't display Unicode on Windows with strings read from Indy 10  (Read 19228 times)

MementoMojito

  • Jr. Member
  • **
  • Posts: 63
Hi,

I'm having a really annoying issue when passing strings to a TLabel or TLisBox, it displays a question mark rather than the char with accent.
I have tried to use UTF8Encode() AnsiToUTF8(), UTF8ToAnsi() but nothing works.
I have changed the Font Charset to Unicode too just in case but no joy.
It works perfectly fine on Linux Jessie with KDE, it's only on Windows.
I have attached a screenshot that shows the issue if anyone knows how to fix that? I am certain it's something stupid that I have missed.

« Last Edit: May 27, 2016, 02:44:06 pm by MementoMojito »

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
Re: Can't display Unicode on Win 8 (Laz 1.6/fpc 3)
« Reply #1 on: May 20, 2016, 10:59:17 pm »
You don't need explicit string conversion function calls any more with Lazarus 1.6 / FPC 3.0.
Only if your data source (file, DB ...) has encoding different from UTF-8 then you must take action as described here:
 http://wiki.lazarus.freepascal.org/Better_Unicode_Support_in_Lazarus#Reading_.2F_writing_text_file_with_Windows_codepage

In other situations the dynamic encoding of strings takes care of conversions automatically.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

MementoMojito

  • Jr. Member
  • **
  • Posts: 63
Re: Can't display Unicode on Win 8 (Laz 1.6/fpc 3)
« Reply #2 on: May 20, 2016, 11:20:16 pm »
It could be that the string is read from a socket with Indy 10:

TrackName :=thCl.IOHandler.Readln(#10);


TrackName being the var passed to my label caption.
However I would assume that the implicit conversion would still happen right?
Plus the source here (the server) is the socket interface of VLC running on Raspbian, I am fairly sure the strings sent are encoded in Unicode...

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
Re: Can't display Unicode on Win 8 (Laz 1.6/fpc 3)
« Reply #3 on: May 20, 2016, 11:37:56 pm »
If the client code works on Linux but not on Windows, then the problem is in client code and not in server code.
Maybe Indy does something funny and converts data to system codepage.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

balazsszekely

  • Guest
Re: Can't display Unicode on Win 8 (Laz 1.6/fpc 3)
« Reply #4 on: May 21, 2016, 05:34:13 am »
Choose a default encoding on both side:
Code: Pascal  [Select][+][-]
  1. thCl.IOHandler.DefStringEncoding := IndyTextEncoding_UTF8; //for example
  2. TrackName :=thCl.IOHandler.Readln(#10);

IndyTextEncoding_UTF8 it's decalred in unit IdGlobal.

MementoMojito

  • Jr. Member
  • **
  • Posts: 63
Re: Can't display Unicode on Win 8 (Laz 1.6/fpc 3)
« Reply #5 on: May 21, 2016, 12:12:56 pm »
Thank you both for the advices.
It looks like it could come from Indy but I still can't find a way to get it encoded properly.
By using IndyTextEncoding_UTF8 I get something different but still not what I would expect (I have attached the screenshot)
I have tried _ASCII and _OSDEFAULT but I have the same result as shown in my first post.
In case it helps I have attached the hex dump from Wireshark showing a track name being sent, I have underlined the bytes for the only Unicode char in this track, if there is anything helpful from there.

balazsszekely

  • Guest
Re: Can't display Unicode on Win 8 (Laz 1.6/fpc 3)
« Reply #6 on: May 21, 2016, 05:38:37 pm »
@MementoMojito
I just test this, it works fine:

1.Server(OnExecute)
Code: Pascal  [Select][+][-]
  1.   AContext.Connection.IOHandler.WriteLn('Some text', IndyTextEncoding_UTF8);
  2.   //or                                
  3.   AContext.Connection.IOHandler.ReadLn(#10, IndyTextEncoding_UTF8);

2. Client
Code: Pascal  [Select][+][-]
  1.    IdTCPClient1.IOHandler.WriteLn('Some text', IndyTextEncoding_UTF8);
  2.    //or
  3.    IdTCPClient1.IOHandler.ReadLn(#10, IndyTextEncoding_UTF8);
  4.  
You can also set the default encoding. If still not working, please attach a small test case(client/server) so we can run a few tests.

MementoMojito

  • Jr. Member
  • **
  • Posts: 63
Re: Can't display Unicode on Win 8 (Laz 1.6/fpc 3)
« Reply #7 on: May 21, 2016, 08:48:41 pm »
@GetMem
Thanks a lot for the suggestions.
Unfortunately I don't have the hand on the server as it's VCL socket interface. However it works just fine on Linux with the exact same code.
I have tried yesterday to use ReadLn(#10, IndyTextEncoding_UTF8) but it's exactly the same as defining an encoding for the IOHandler via DefStringEncoding :/
So at this stage I guess the two hypothesis we have is that Indy doesn't behave the same on Windows or Windows is a bit more fiddly to handle Unicode? Both can be true and related tho :)

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1312
    • Lebeau Software
Re: Can't display Unicode on Win 8 (Laz 1.6/fpc 3)
« Reply #8 on: May 25, 2016, 07:48:02 pm »
Unfortunately I don't have the hand on the server as it's VCL socket interface. However it works just fine on Linux with the exact same code.

The screenshots provided are not of the same data, so it is difficult to see why the '?' characters are appearing.  Please make sure the screenshots actually match each other.

From what you have described, maybe the data is NOT actually UTF-8.  Since it is coming from a VCL-based server, it is likely that the data may actually be ANSI instead, using whatever the server's local charset is set to, such as Windows-1252.  The VCL's native socket components are not Unicode-enabled.  On the other hand, maybe the VCL server is using other socket components, like Indy, but is still not coded to handle Unicode string encodings correctly.  Hard to say without knowing anything about the server.  But either way, if that turns out to be the case, Indy does provide CharsetToEncoding(Charset), IndyTextEncoding(Charset), and IndyTextEncoding(Codepage) functions to retrieve IIdTextEncoding interfaces for various charsets.

I have tried yesterday to use ReadLn(#10, IndyTextEncoding_UTF8) but it's exactly the same as defining an encoding for the IOHandler via DefStringEncoding :/

Yes, it is.

Indy doesn't behave the same on Windows

For the most part, it does.  What is different, though, is the underlying library used to handle byte<->char conversions.  On Windows, Microsoft's MultiByteToWideChar() and WideCharToMultiByte() functions are used.  On other platforms, the ICONV library is used instead.  Both support UTF-8, though.

or Windows is a bit more fiddly to handle Unicode?

Not in this case, since UTF-8 and UTF-16 are standardized encodings that are well supported on just about all platforms.  But that is assuming the data is actually UTF-8 to begin with, and I'm assuming it is NOT unless I see differently.
« Last Edit: May 25, 2016, 07:59:36 pm by Remy Lebeau »
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

MementoMojito

  • Jr. Member
  • **
  • Posts: 63
Re: Can't display Unicode on Win 8 (Laz 1.6/fpc 3)
« Reply #9 on: May 26, 2016, 12:46:29 am »
Hi Remy,

Thanks a lot for your help and your work with Indy :)
I really do believe it's utf-8.
This time I have attached all the matching screencaps, sorry for posting the wrong ones initially.
They show what I get on Linux, what I get on Windows and as well the hexdump from Wireshark with the 3 bytes of the char that should be displayed. The char is '–'.
Please see below the really short snippet in Python to decode these 3 bytes using utf-8 (other encoding fails including ASCII, UTF-16 and UTF-7):

Code: Python  [Select][+][-]
  1. >>> s = '\xe2\x80\x93'
  2. >>> print(s.decode('utf-8'))
  3.  

For the Windows one you will notice that in one case it displays the diamond rather that the question mark it's because I am using IndyTextEncoding_UTF8, however if I pass the value directly to the label:

Code: Pascal  [Select][+][-]
  1. myLabel.caption := #$e2#$80#$93;
  2.  

 it can display it so it's not a missing font.

Also I would like to say again that VLC is running on a Raspberry Pi on Raspbian Jessie, so it wouldn't use any Windows charset or the like.
« Last Edit: May 26, 2016, 12:53:08 am by MementoMojito »

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1312
    • Lebeau Software
Re: Can't display Unicode on Win 8 (Laz 1.6/fpc 3)
« Reply #10 on: May 26, 2016, 01:59:45 am »
I really do believe it's utf-8.

This time I have attached all the matching screencaps, sorry for posting the wrong ones initially.
They show what I get on Linux, what I get on Windows and as well the hexdump from Wireshark with the 3 bytes of the char that should be displayed. The char is '–'.

Yes, the character in question is encoded using UTF-8 on the socket transmission.  The fact that it takes three bytes to encode in UTF-8, and the Windows output shows three '?' characters, proves that the Windows app is NOT decoding the bytes as UTF-8, but rather is decoding them using some other single-byte encoding instead.  But that would NOT be the case if you are asking Indy to use IndyTextEncoding_UTF8 for reading from the socket.  So something else has to be happening.

What I find interesting is that your Windows screenshot shows two different representations of the same string.  The upper string shows three '?' characters, which would be consistent with the original UTF-8 bytes being decoded as a single-byte encoding instead of as UTF-8.  The lower string, however, only shows one '?' character, which implies that the original UTF-8 bytes were decoded as UTF-8 to produce a single Unicode character, and then the string was converted to another charset for display in the UI, where the charset does not support that Unicode character.

For the Windows one you will notice that in one case it displays the diamond rather that the question mark it's because I am using IndyTextEncoding_UTF8, however if I pass the value directly to the label:

Code: Pascal  [Select][+][-]
  1. myLabel.caption := #$e2#$80#$93;
  2.  

it can display it so it's not a missing font.

The only way that would work in FreePascal is if the source file itself is in UTF-8 mode via {$codepage utf8} or -FcUTF8 (see Better Unicode Support in Lazarus).

In Delphi, that code would not work as-is, even when the source file is saved as UTF-8.  The compiler would parse the byte values as three separate Unicode characters (U+00E2, U+0080, U+0093), not as one 3-byte UTF-8 encoded character (U+2013).  You would need a type-cast and a runtime conversion, eg:

Code: Pascal  [Select][+][-]
  1. myLabel.caption := string(UTF8String(PAnsiChar(#$e2#$80#$93)));
  2.  

Or:

Code: Pascal  [Select][+][-]
  1. var
  2.   utf8: array of byte;
  3. begin
  4.   SetLength(utf8, 3);
  5.   utf8[0] := $e2;
  6.   utf8[1] := $80;
  7.   utf8[2] := $93;
  8.   myLabel.caption := TEncoding.UTF8.GetString(utf8);
  9.   // or IndyTextEncoding_UTF8.GetString()...
  10. end;
  11.  

Or:

Code: Pascal  [Select][+][-]
  1. var
  2.   utf8: UTF8String;
  3. begin
  4.   SetLength(utf8, 3);
  5.   utf8[1] := AnsiChar($e2);
  6.   utf8[2] := AnsiChar($80);
  7.   utf8[3] := AnsiChar($93);
  8.   myLabel.caption := string(utf8);
  9. end;
  10.  
« Last Edit: May 26, 2016, 02:07:28 am by Remy Lebeau »
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

MementoMojito

  • Jr. Member
  • **
  • Posts: 63
Re: Can't display Unicode on Win 8 (Laz 1.6/fpc 3)
« Reply #11 on: May 26, 2016, 01:24:21 pm »
Quote
What I find interesting is that your Windows screenshot shows two different representations of the same string.  The upper string shows three '?' characters, which would be consistent with the original UTF-8 bytes being decoded as a single-byte encoding instead of as UTF-8.  The lower string, however, only shows one '?' character, which implies that the original UTF-8 bytes were decoded as UTF-8 to produce a single Unicode character, and then the string was converted to another charset for display in the UI, where the charset does not support that Unicode character.

It looks like indeed Indy convert it to a single char from the 3 bytes but I am not sure the charset explanation could totally make sense here. If it was a charset problem why am I able to display the char with the exact same label?
What if instead the Unicode code returned by IndyTextEncoding_UTF8 is wrong and that's why it can't find a corresponding char?

Quote
The only way that would work in FreePascal is if the source file itself is in UTF-8 mode via {$codepage utf8} or -FcUTF8 (see Better Unicode Support in Lazarus).

I suppose that's default behavior with Laz 1.6/fpc 3 as I didn't include any directive in the source. Could it invalid the fact that the label I am using with the charset it uses is capable of displaying Unicode char?

Again, thanks a lot for your interest and help. If you have any idea what I could test please let me know, I really need to find a way to get it working.

balazsszekely

  • Guest
Re: Can't display Unicode on Win 8 (Laz 1.6/fpc 3)
« Reply #12 on: May 26, 2016, 01:43:41 pm »
You can try something like this:
Code: Pascal  [Select][+][-]
  1. uses LConvEncoding;
  2.   //...
  3.  TrackName :=thCl.IOHandler.Readln(#10);  
  4.  ShowMessage(GuessEncoding(TrackName));
                                             
Do this with and without IndyTextEncoding_UTF8. What are the results?

MementoMojito

  • Jr. Member
  • **
  • Posts: 63
Re: Can't display Unicode on Win 8 (Laz 1.6/fpc 3)
« Reply #13 on: May 27, 2016, 12:55:39 am »
Thank you so much all, and specially GetMem for the last suggestion, I am finally going somewhere.
I now can display the Unicode characters doing as follow:

Code: Pascal  [Select][+][-]
  1. thCl.IOHandler.defStringEncoding := IndyTextEncoding_UTF8;
  2. TrackName := thCl.IOHandler.readLn(#10);
  3. TrackName := ConvertEncoding(TrackName,GuessEncoding(TrackName),EncodingUTF8);
  4.  

But now the weird things I really don't understand.

1) If I remove
Code: Pascal  [Select][+][-]
  1. thCl.IOHandler.defStringEncoding := IndyTextEncoding_UTF8;
the 3 questions mark are back so what does the ConvertEncoding() do exactly as I explicitly ask to convert it to utf-8? I would expect a single char, even not recognized.
Plus without it GuessEncoding() returns utf-8?

2) With
Code: Pascal  [Select][+][-]
  1. thCl.IOHandler.defStringEncoding := IndyTextEncoding_UTF8;
back, GuessEncoding() return cp1252 aka Latin 1, I suspect it's osDefault and that Indy might be at fault here? But why does it needs to be decoded as Latin 1 first to be then decoded properly as utf-8 using ConvertEncoding()?

It's really confusing.

And now it looks like that I might need to convert from UTF-8 to UTF-16 in some case as it still doesn't display kanjis. I have question marks after converting them to UTF-8 (one per kanji instead of one question mark per byte before the encoding), I have tried UTF8ToUTF16() but it tells me it can't find it and I couldn't find what unit I should use.


« Last Edit: May 27, 2016, 01:04:28 am by MementoMojito »

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1312
    • Lebeau Software
Re: Can't display Unicode on Win 8 (Laz 1.6/fpc 3)
« Reply #14 on: May 27, 2016, 01:21:22 am »
It looks like indeed Indy convert it to a single char from the 3 bytes

That would imply that Indy is indeed decoding the bytes as UTF-8.

but I am not sure the charset explanation could totally make sense here. If it was a charset problem why am I able to display the char with the exact same label?

If you hard-code the char in your source code, it would make sense if the compiler is encoding the char in the same charset that the label is then expecting, but Indy is decoding the same char to a different encoding.

Do your UI controls expect UTF-8 encoded strings?  I am not familiar with how Lazarus works, but from your descriptions, it sounds like things are working fine when you force the socket output to be UTF-8.

What if instead the Unicode code returned by IndyTextEncoding_UTF8 is wrong and that's why it can't find a corresponding char?

It is highly unlikely that the decoded data is wrong, given that both Windows and ICONV have proper support for UTF-8 encoding/decoding.

On the other hand, IIdTextEncoding always returns UTF-16 when decoding bytes to characters, and UTF-16 is not FreePascal's native string encoding by default (it is Delphi's native string encoding since 2009).  Even though FreePascal does have a Delphi-like UnicodeString type available, its String type does not map to UnicodeString unless you are compiling with either {$MODE DelphiUnicode} or {$MODESWITCH UnicodeStrings}.  If Indy is not compiled with one of those modes enabled (and it does not enable either one yet, see the comments about that in IdCompilerDefines.inc), then the String type maps to AnsiString, and thus ReadLn() has to perform a data conversion when it is ready to return the decoded UTF-16 data as an AnsiString.

In that situation, the IOHandler has an additional DefAnsiEncoding property, and ReadLn() has an additional ADestEncoding parameter, to specify the charset that the AnsiString should be encoded as.  By default, on Windows that charset is the current OS locale (IndyTextEncoding_OSDefault), whatever the user happens to be using. On Linux, it is UTF-8 instead.

So, that could be accounting for some of the issues you are seeing on Windows but not on Linux.

I now can display the Unicode characters doing as follow:

Code: Pascal  [Select][+][-]
  1. thCl.IOHandler.defStringEncoding := IndyTextEncoding_UTF8;
  2. TrackName := thCl.IOHandler.readLn(#10);
  3. TrackName := ConvertEncoding(TrackName,GuessEncoding(TrackName),EncodingUTF8);
  4.  

This is basically forcing the String to be UTF-8 encoded if it is not already, which only makes sense to do if the native String type is AnsiString and not UnicodeString.  In which case, you can account for that without resorting to ConvertEncoding():

Code: Pascal  [Select][+][-]
  1. thCl.IOHandler.DefStringEncoding := IndyTextEncoding_UTF8;
  2. thCl.IOHandler.DefAnsiEncoding := IndyTextEncoding_UTF8; // <-- add this
  3. TrackName := thCl.IOHandler.readLn(#10);
  4.  

When String is AnsiString, this tells ReadLn() to decode the received bytes as UTF-8 and then return the decoded characters as UTF-8.

1) If I remove
Code: Pascal  [Select][+][-]
  1. thCl.IOHandler.defStringEncoding := IndyTextEncoding_UTF8;
the 3 questions mark are back so what does the ConvertEncoding() do exactly as I explicitly ask to convert it to utf-8? I would expect a single char, even not recognized.
Plus without it GuessEncoding() returns utf-8?

The DefStringEncoding property is set to US-ASCII by default, which would account for the 3 '?' characters since any byte >= $80 will get decoded as Unicode codepoint U+FFFD, which would become '?' when converted to Ansi.  When such an AnsiString is passed to GuessEncoding(), it would only see ASCII characters, and thus would report ASCII or maybe UTF-8 (since ASCII is a subset of UTF-8).

2) With
Code: Pascal  [Select][+][-]
  1. thCl.IOHandler.defStringEncoding := IndyTextEncoding_UTF8;
back, GuessEncoding() return cp1252 aka Latin 1, I suspect it's osDefault and that Indy might be at fault here? But why does it needs to be decoded as Latin 1 first to be then decoded properly as utf-8 using ConvertEncoding()?

That makes perfect sense when the String type is AnsiString.  On Windows, the DefAnsiEncoding property is the user's current locale by default (in this case, cp1252), so that would be used for the conversion from UTF-16 to Ansi when ReadLn() exits.  So basically, Indy is doing a UTF8 -> UTF16 -> cp1252 conversion, and then you are doing a cp2512 -> UTF8 conversion on top of that.  The only way that conversion would be loss-less is if the original transmitted string is using Unicode characters that cp1252 supports, otherwise you will end up with '?' characters.  This would account for the 1 '?' that you see (since cp1252 does not support U+2013), versus the 3 '?' when the bytes are decoded as ASCII instead of UTF-8.

And now it looks like that I might need to convert from UTF-8 to UTF-16 in some case as it still doesn't display kanjis.

Don't try storing UTF-16 in an AnsiString.  Convert a UTF-8 string to a UTF16String, WideString, or UnicodeString instead.

IF you needed an AnsiString with kanjis in it, the AnsiString would need to be encoded using an Ansi charset that supports kanjis, for instance Shift-JIS (cp943 in Windows).  In Indy, you can use CharsetToEncoding('shift-jis') or IndyTextEncoding(943) to obtain an IIdTextEncoding for that charset.

I have tried UTF8ToUTF16() but it tells me it can't find it and I couldn't find what unit I should use.

http://lazarus-ccr.sourceforge.net/docs/lcl/lclproc/utf8toutf16.html
« Last Edit: May 27, 2016, 02:02:18 am by Remy Lebeau »
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

 

TinyPortal © 2005-2018