It looks like indeed Indy convert it to a single char from the 3 bytes
That would imply that Indy is indeed decoding the bytes as UTF-8.
but I am not sure the charset explanation could totally make sense here. If it was a charset problem why am I able to display the char with the exact same label?
If you hard-code the char in your source code, it would make sense if the compiler is encoding the char in the same charset that the label is then expecting, but Indy is decoding the same char to a different encoding.
Do your UI controls expect UTF-8 encoded strings? I am not familiar with how Lazarus works, but from your descriptions, it sounds like things are working fine when you force the socket output to be UTF-8.
What if instead the Unicode code returned by IndyTextEncoding_UTF8 is wrong and that's why it can't find a corresponding char?
It is highly unlikely that the decoded data is wrong, given that both Windows and ICONV have proper support for UTF-8 encoding/decoding.
On the other hand, IIdTextEncoding always returns UTF-16 when decoding bytes to characters, and UTF-16 is not FreePascal's native string encoding by default (it is Delphi's native string encoding since 2009). Even though FreePascal does have a Delphi-like UnicodeString type available, its String type does not map to UnicodeString unless you are compiling with either {$MODE DelphiUnicode} or {$MODESWITCH UnicodeStrings}. If Indy is not compiled with one of those modes enabled (and it does not enable either one yet, see the comments about that in IdCompilerDefines.inc), then the String type maps to AnsiString, and thus ReadLn() has to perform a data conversion when it is ready to return the decoded UTF-16 data as an AnsiString.
In that situation, the IOHandler has an additional DefAnsiEncoding property, and ReadLn() has an additional ADestEncoding parameter, to specify the charset that the AnsiString should be encoded as. By default, on Windows that charset is the current OS locale (IndyTextEncoding_OSDefault), whatever the user happens to be using. On Linux, it is UTF-8 instead.
So, that could be accounting for some of the issues you are seeing on Windows but not on Linux.
I now can display the Unicode characters doing as follow:
thCl.IOHandler.defStringEncoding := IndyTextEncoding_UTF8;
TrackName := thCl.IOHandler.readLn(#10);
TrackName := ConvertEncoding(TrackName,GuessEncoding(TrackName),EncodingUTF8);
This is basically forcing the String to be UTF-8 encoded if it is not already, which only makes sense to do if the native String type is AnsiString and not UnicodeString. In which case, you can account for that without resorting to ConvertEncoding():
thCl.IOHandler.DefStringEncoding := IndyTextEncoding_UTF8;
thCl.IOHandler.DefAnsiEncoding := IndyTextEncoding_UTF8; // <-- add this
TrackName := thCl.IOHandler.readLn(#10);
When String is AnsiString, this tells ReadLn() to decode the received bytes as UTF-8 and then return the decoded characters as UTF-8.
1) If I remove thCl.IOHandler.defStringEncoding := IndyTextEncoding_UTF8;
the 3 questions mark are back so what does the ConvertEncoding() do exactly as I explicitly ask to convert it to utf-8? I would expect a single char, even not recognized.
Plus without it GuessEncoding() returns utf-8?
The DefStringEncoding property is set to US-ASCII by default, which would account for the 3 '?' characters since any byte >= $80 will get decoded as Unicode codepoint U+FFFD, which would become '?' when converted to Ansi. When such an AnsiString is passed to GuessEncoding(), it would only see ASCII characters, and thus would report ASCII or maybe UTF-8 (since ASCII is a subset of UTF-8).
2) With thCl.IOHandler.defStringEncoding := IndyTextEncoding_UTF8;
back, GuessEncoding() return cp1252 aka Latin 1, I suspect it's osDefault and that Indy might be at fault here? But why does it needs to be decoded as Latin 1 first to be then decoded properly as utf-8 using ConvertEncoding()?
That makes perfect sense when the String type is AnsiString. On Windows, the DefAnsiEncoding property is the user's current locale by default (in this case, cp1252), so that would be used for the conversion from UTF-16 to Ansi when ReadLn() exits. So basically, Indy is doing a UTF8 -> UTF16 -> cp1252 conversion, and then you are doing a cp2512 -> UTF8 conversion on top of that. The only way that conversion would be loss-less is if the original transmitted string is using Unicode characters that cp1252 supports, otherwise you will end up with '?' characters. This would account for the 1 '?' that you see (since cp1252 does not support U+2013), versus the 3 '?' when the bytes are decoded as ASCII instead of UTF-8.
And now it looks like that I might need to convert from UTF-8 to UTF-16 in some case as it still doesn't display kanjis.
Don't try storing UTF-16 in an AnsiString. Convert a UTF-8 string to a UTF16String, WideString, or UnicodeString instead.
IF you needed an AnsiString with kanjis in it, the AnsiString would need to be encoded using an Ansi charset that supports kanjis, for instance Shift-JIS (cp943 in Windows). In Indy, you can use CharsetToEncoding('shift-jis') or IndyTextEncoding(943) to obtain an IIdTextEncoding for that charset.
I have tried UTF8ToUTF16() but it tells me it can't find it and I couldn't find what unit I should use.
http://lazarus-ccr.sourceforge.net/docs/lcl/lclproc/utf8toutf16.html