Recent

Author Topic: A simple sane question to end insanity! TStringList in unicode mode  (Read 24118 times)

EganSolo

  • Sr. Member
  • ****
  • Posts: 290
Hi guys,

I'm trying to get rid of the pesky: Warning: Implicit string type conversion from "AnsiString" to "UnicodeString" when using TStringList.

I've read numerous articles, followed forum discussions and I'm still unclear as to what I have to do to fix this.

I see three options:
1. There's a switch I need to add to recompile classesh.inc and the rest of the RTL in unicode mode
2. RTL is not yet Unicode compatible, therefore these warnings are correct and I need to handle them case by case, or stop using TStringList
3. Got the barber, have a 3-years-long hair cut and hope the problem goes away in the meantime...

Any straightforward suggestion of the type "Do XYZ and the problem goes away" or "If you want unicode support don't use TSTringList" would be welcome!

Thanks!

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11458
  • FPC developer.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #1 on: August 05, 2017, 10:51:58 pm »
Well, 2 is true, but the core point is that a warning is not an error.

So if it is not applicable to your situation (which might be here, since the safety of this assignment  can't be determined compiletime, because the codepage for the 1-byte ansistring is determined runtime, and with that the possibility of loss), just ignore it or 3.

EganSolo

  • Sr. Member
  • ****
  • Posts: 290
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #2 on: August 05, 2017, 11:06:09 pm »
Thanks Marcov. It was as I feared.

In my case, it can very well be applicable. I'm writing a parser for a unicode language and so that matters.
Okeydokey...

I'll leave TStringList alone and rely on other structures then.

Cheers,

Egan

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4474
  • I like bugs.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #3 on: August 05, 2017, 11:57:16 pm »
TStringList works very well with the Lazarus UTF-8 system.
This page was maybe improved since you last saw it:
  http://wiki.freepascal.org/Unicode_Support_in_Lazarus

I'm writing a parser for a unicode language and so that matters.
It works when you always use type String for a Unicode "character". This holds true for both UTF-8 and UTF-16 encodings.
From earlier posts I saw you had problems with Windows console. That is nasty indeed because it does not support Unicode.
If your data is Unicode from the start to finish then everything is easy.

See also the encoding agnostic code with unit LazUnicode.
I planned to rewrite "UTF8_strings_and_characters" to use it but haven't done it yet.
  http://wiki.freepascal.org/Unicode_Support_in_Lazarus#Unicode_characters_and_codepoints_in_code
Show me a (maybe buggy) piece of Unicode related code. I believe we can turn it into a version that works with both encodings, with both FPC and Delphi and supports all codepoints correctly.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Thaddy

  • Hero Member
  • *****
  • Posts: 14382
  • Sensorship about opinions does not belong here.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #4 on: August 06, 2017, 07:56:12 am »
From earlier posts I saw you had problems with Windows console. That is nasty indeed because it does not support Unicode.
If your data is Unicode from the start to finish then everything is easy.
Wrong. It does support unicode, but the default font does not.
- Change the font to a fixed pitch unicode ttf one. (E.g. Lucida Console)
- It also requires cmd.exe /U for unicode io and pipes to work. (E.g. clipboard, read,write)
- Furthermore it needs a unicode default codepage to be set. (E.g. chcp 65001)

You can persist all of this in the registry.
« Last Edit: August 06, 2017, 08:21:07 am by Thaddy »
Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

AlexTP

  • Hero Member
  • *****
  • Posts: 2406
    • UVviewsoft
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #5 on: August 06, 2017, 05:45:20 pm »
EganSolo,
I always do UTF8Encode (UTF8Decode, dont remember) when i get item from TStringList to UnicodeString var.
And back.
You can do it too?

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4474
  • I like bugs.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #6 on: August 06, 2017, 07:48:00 pm »
I always do UTF8Encode (UTF8Decode, dont remember) when i get item from TStringList to UnicodeString var.
You should not use UnicodeString except for WinAPI calls.
Using "String" makes the code compatible with Delphi at source level despite their different encodings, as amazing as it sounds.
Also, you don't need the UTF8 specific conversion functions any more because FPC converts string data automatically.
Please don't confuse people!
« Last Edit: August 06, 2017, 07:52:28 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11458
  • FPC developer.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #7 on: August 06, 2017, 10:30:07 pm »
In my case, it can very well be applicable. I'm writing a parser for a unicode language and so that matters.
Okeydokey...

As Juha says, the default Lazarus mode is different but still unicode compatible. In that case the warning is wrong, and the conversion is lossless.

Quote
I'll leave TStringList alone and rely on other structures then.

If it matters, make sure you use the proper Lazarus mode. You don't give much info, but I can't imagine any problem that warrants such drastic steps.
 

carl_caulkett

  • Sr. Member
  • ****
  • Posts: 306
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #8 on: August 06, 2017, 11:31:39 pm »
So are you saying that the compiler warnings I see when I use, for example, the TJSONConfig class with standard strings, can just be ignored?
"It builds... ship it!"

Mac Mini M1
macOS 13.6 Ventura
Lazarus 2.2.6 (release version)
FPC 3.2.2 (release version)

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4474
  • I like bugs.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #9 on: August 07, 2017, 10:11:20 am »
So are you saying that the compiler warnings I see when I use, for example, the TJSONConfig class with standard strings, can just be ignored?
Short answer: YES!

If I understand right, the string type used is:
Code: Pascal  [Select][+][-]
  1. TJSONStringType = UTF8String;
For UTF8String <-> String assignments the compiler inserts code to test encodings and to convert data when needed.
With the Lazarus UTF-8 system no conversion is needed and the warnings can be ignored, yes.

Some JSON method parameters also use:
Code: Pascal  [Select][+][-]
  1. TJSONUnicodeStringType = Unicodestring;
The compiler then converts automatically between UTF-16 and UTF-8. Again the warnings can be ignored.

When the Lazarus UTF-8 system is not used, then you should use Unicode by other means, like using {$ModeSwitch UnicodeStrings}.
The default AnsiString with Windows codepage leads to lossy conversions. IMO nobody should use it any more. It can be seen as a historical remain, kept for backwards compatibility.
« Last Edit: August 07, 2017, 10:23:33 am by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

carl_caulkett

  • Sr. Member
  • ****
  • Posts: 306
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #10 on: August 07, 2017, 03:59:08 pm »
In fact, the TJSONConfig class is absolutely riddled with UnicodeString declarations. Does the same advice still apply?
"It builds... ship it!"

Mac Mini M1
macOS 13.6 Ventura
Lazarus 2.2.6 (release version)
FPC 3.2.2 (release version)

Thaddy

  • Hero Member
  • *****
  • Posts: 14382
  • Sensorship about opinions does not belong here.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #11 on: August 07, 2017, 04:22:04 pm »
In fact, the TJSONConfig class is absolutely riddled with UnicodeString declarations. Does the same advice still apply?

In principle, yes I think.
Depending on application, however, there is also:
Code: Pascal  [Select][+][-]
  1. program unicodelist;
  2. {$mode delphiunicode}{$warn 5063 off}
  3. uses sysutils,generics.collections;
  4. var
  5.   L:TList<string>; //UnicodeString in mode delphi unicode!! You can also use TFPGList<string> from fgl.
  6. begin
  7.   L:=TList<String>.Create;
  8.  try
  9.    L.Add('This is a unicode string, charsize = ');
  10.    writeln(L[0],SizeOf(Char));
  11.  finally
  12.    L.Free;
  13.  end;
  14. end.
« Last Edit: August 07, 2017, 04:45:05 pm by Thaddy »
Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4474
  • I like bugs.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #12 on: August 08, 2017, 10:48:24 am »
In fact, the TJSONConfig class is absolutely riddled with UnicodeString declarations. Does the same advice still apply?
Ok, now I found TJSONConfig. I was looking at the unit fpjson earlier.
If your code uses {$ModeSwitch UnicodeStrings} then no conversions are needed.
With Lazarus UTF-8 system the data is converted but works correctly.
With some DEFINEs and IFDEFs the TJSONConfig code could use different string types. It tests for "Pos('/',S)" which works with any encoding.

The conversion UTF-16 <-> UTF-8 is faster than one would expect, thus there is no big problem.
It would be interesting to measure what is the performance penalty.
« Last Edit: August 08, 2017, 02:30:57 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

EganSolo

  • Sr. Member
  • ****
  • Posts: 290
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #13 on: August 15, 2017, 04:31:38 am »
Apologies for the long silence. I've had to dig deeper into this issue and I admit it is confusing.

First and foremost, here are the settings I'm using across all my packages. Maybe these are wrong but that's what I was able to get from reading about unicode...

Code: Pascal  [Select][+][-]
  1. {$modeswitch UnicodeStrings}
  2. {$codepage utf-8}      
  3.  

Now, concerning TStrings and TStringList, they use code which is specific to ansistring.

For instance: TStrings.DoCompareText, found on line 878 of stringl.inc, relies on the non-unicode function CompareText. Now this method in turn is found on line 176 of sysstr.inc. As far as I can tell, and I can be wrong it relies on the following critical comparison:
Code: Pascal  [Select][+][-]
  1.   Chr1 := byte(p1^);
  2.   Chr2 := byte(p2^);
  3.  

I fail to see how this would work in a unicode setting.

Furthermore, that same method is overriden in TStringList as follows:

Code: Pascal  [Select][+][-]
  1. function TStringList.DoCompareText(const s1, s2: string): PtrInt;
  2. begin
  3.   if FCaseSensitive then
  4.     result:=AnsiCompareStr(s1,s2)
  5.   else
  6.     result:=AnsiCompareText(s1,s2);
  7. end;
  8.  

Again, I fail to see how this code would work with unicode strings.

Lastly, I just found out that on Windows, AnsiCompareStr is actually broken. You can see that from another post of mine: http://forum.lazarus.freepascal.org/index.php/topic,37926.msg256327.html#msg256327

Do to all of these issues, I've written a drop-compatible replacement of TStringList that supports Unicode from the ground up.

If anyone is interested, I'd be happy to share this code with you.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4474
  • I like bugs.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #14 on: August 15, 2017, 08:48:16 am »
First and foremost, here are the settings I'm using across all my packages. Maybe these are wrong but that's what I was able to get from reading about unicode...
Apparently you did not read the page about Unicode support in Lazarus which I gave earlier.
  http://wiki.freepascal.org/Unicode_Support_in_Lazarus
After reading it you should know that modeswitch and codepage modifiers are not needed.

Quote
Now, concerning TStrings and TStringList, they use code which is specific to ansistring.
...
Code: Pascal  [Select][+][-]
  1. function TStringList.DoCompareText(const s1, s2: string): PtrInt;
  2. begin
  3.   if FCaseSensitive then
  4.     result:=AnsiCompareStr(s1,s2)
  5.   else
  6.     result:=AnsiCompareText(s1,s2);
  7. end;

Again, I fail to see how this code would work with unicode strings.
Again, it is explained in the same wiki-page:
  http://wiki.freepascal.org/Unicode_Support_in_Lazarus#Technical_implementation
"Under Windows the UTF8...() functions in LazUTF8 (LazUtils) are set as backends for RTL's Ansi...() string functions. Thus those functions work in a Delphi compatible way."
It is a rather brilliant system, isn't it?

Quote
Lastly, I just found out that on Windows, AnsiCompareStr is actually broken.
That is a different issue and irrelevant here because the Windows AnsiCompareStr is not used. It is replaced by UTF8CompareStr.

Quote
Do to all of these issues, I've written a drop-compatible replacement of TStringList that supports Unicode from the ground up.
C'mon, TStringList supports 100% perfectly Unicode when used with the Lazarus Unicode system. Why you refuse to understand it?
« Last Edit: August 15, 2017, 08:52:10 am by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

 

TinyPortal © 2005-2018