Recent

Author Topic: new AnsiString question  (Read 43416 times)

Zoran

  • Hero Member
  • *****
  • Posts: 1974
    • http://wiki.lazarus.freepascal.org/User:Zoran
Re: new AnsiString question
« Reply #15 on: March 17, 2016, 11:02:13 am »

It means this problem is artificial. Just use "String" and everything works like magic. It is also amazingly Delphi compatible at source level because typical code does not study codepoints beyond ASCII. For example the ~2000000 LOC Lazarus project itself does not do it. We could remove all calls to the special UTF8...() functions from the IDE's code and it would work. Now we keep them for compatibility with FPC 2.6.4.

The people who claim this UTF-8 system is "really bad", clearly have not tested it. Such claims were understandable when this was under construction and there was nothing to test. Now it can be tested!

I totally agree with you. As I said, The way Lazarus took is perfect for me. I just use "String" type and treat it as UTF8. I never use UTF8String and I have no problems.

... the different behaviour between true constant and typed constant, when compiler says that they are both row byte strings. The assignment should not make any conversions then in both cases.

Ok, I don't have explanation for that one. You must ask FPC developers.

Quote
The workaround he is reffering to is - use typed const instead of true const.

Is that a workaround? In any case it is not needed because you should use "String" variable when assigning from constant.
You can use UTF8String when assigning between variables but even then it makes no sense. It generates useless encoding checks in your compiled code but does not bring any benefits.

If you use UTF8String type (which you don't need, okay), then this is a workaround...
I just wanted to understand this behaviour of constant Strings... I still suspect that this can be bug in FPC.

Swan, ZX Spectrum emulator https://github.com/zoran-vucenovic/swan

Bart

  • Hero Member
  • *****
  • Posts: 5609
    • Bart en Mariska's Webstek
Re: new AnsiString question
« Reply #16 on: March 17, 2016, 01:17:30 pm »
The special UTF-8 related functions are now empty and dummy, just for backwards compatibility.
Only exception I know of is some RTL / FCL code that still calls the old "A" versions of Windows API functions. They don't support Unicode (regardless of encoding).

Well the file related UTF8 functions are not empty.
Amazingly functions like FileExist/FileCreate work out of the box with unicode strings outside my codepage.
Wow!
Even better than I anticipated.
Hats off to the fpc devels.

Bart

BeniBela

  • Hero Member
  • *****
  • Posts: 947
    • homepage
Re: new AnsiString question
« Reply #17 on: March 17, 2016, 01:31:02 pm »

Only exception I know of is some RTL / FCL code that still calls the old "A" versions of Windows API functions.

And of course I call such functions somewhere in my code, too

I see no reason why they insist on changing ACP.

 string = ansistring = ansistring(CP_ACP) seems to be an invariant in FPC, but there is no reason for that.

They could just have set  string = ansistring = ansistring(CP_UTF8) = UTF8String

It will probably be faster, too, since you do not need to check what the ACP is

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4631
  • I like bugs.
Re: new AnsiString question
« Reply #18 on: March 17, 2016, 01:56:24 pm »
Well the file related UTF8 functions are not empty.

They are dummy wrappers for the RTL functions. For example :
Code: Pascal  [Select][+][-]
  1. function FileExistsUTF8(const Filename: string): boolean;
  2. begin
  3.   Result:=SysUtils.FileExists(UTF8ToSys(Filename));
  4. end;
uses UTF8ToSys() which is now essentially a no-op:
Code: Pascal  [Select][+][-]
  1. function UTF8ToSys(const s: string): string;
  2. begin
  3.   Result:=s;
  4. end;

This is even documented:
  http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#Compatibility_with_LCL_in_Lazarus_1.x
The old behavior is retained when "DisableUTF8RTL" is defined, thus everybody should be happy.

Quote
Amazingly functions like FileExist/FileCreate work out of the box with unicode strings outside my codepage.
Wow!
Even better than I anticipated.

Yes, this system works better than I could have imagined, too. Just 2 years ago I was completely puzzled. I thought using TStringList will inevitably lead to encoding conversions etc.
The potential problems got solved like a miracle.

Quote
Hats off to the fpc devels.

... and to Lazarus devels, too.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4631
  • I like bugs.
Re: new AnsiString question
« Reply #19 on: March 17, 2016, 02:10:04 pm »
And of course I call such functions somewhere in my code, too

You must update your code to use the "W" versions of Windows API in any case.
This is required also by the future Delphi compatible system using UTF-16 encoding.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

malcome

  • Jr. Member
  • **
  • Posts: 81
Re: new AnsiString question
« Reply #20 on: March 18, 2016, 02:35:34 am »
Thank you guys.

  • The constant string is CP_ACP(Code page 0). See line 8-9 in my code. Or try "ShowMessage(IntToStr(StringCodePage('abc')));".
  • CP_ACP represents the currently set DefaultSystemCodePage. See http://wiki.lazarus.freepascal.org/FPC_Unicode_support#Code_page_identifiers
  • DefaultSystemCodePage is CP_UTF8(Code page 65001). Try "ShowMessage(IntToStr(DefaultSystemCodePage));".
  • UTF8String is CP_UTF8. See line 429 in "systemh.inc".
Are these correct?
I cannot understand the strange result yet. %)
« Last Edit: March 18, 2016, 02:43:18 am by malcome »

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: new AnsiString question
« Reply #21 on: March 18, 2016, 06:02:34 pm »
Add some unique text to the beginning of the constant, so instead of
Code: Pascal  [Select][+][-]
  1. cs = 'abcあいうえお123';
make it, for instance:
Code: Pascal  [Select][+][-]
  1. cs = 'engkinabcあいうえお123';

Now compile your application without debug information to reduce its size. Check the executable with your preferred hex editor and you'll see two constants that start with "engkinabc" and end with "123" one is the correct UTF8 encoded constant used for cs, while the other (the longer one?) is not UTF8 and is used for s. Notice that both were generated at compile time.

You can see both if you add -al to the compiler options and check the generated assembly file.
« Last Edit: March 19, 2016, 04:27:58 am by engkin »

malcome

  • Jr. Member
  • **
  • Posts: 81
Re: new AnsiString question
« Reply #22 on: March 22, 2016, 06:34:04 am »
Engkin, Your explanation makes sense to me.
The new compiler forced to convert it at compile time...
It is hard to understand, however.
I want to say that mind your own business to the new compiler.
My feeling is like to buy a stupid automatic vehicle.
I will report it to Japanese users...
Thank you guys! :)
« Last Edit: March 22, 2016, 07:12:44 am by malcome »

Zoran

  • Hero Member
  • *****
  • Posts: 1974
    • http://wiki.lazarus.freepascal.org/User:Zoran
Re: new AnsiString question
« Reply #23 on: March 22, 2016, 08:29:45 am »
I think that I finally understand why the difference between true const and typed const.
When using true constant then the assignment down in code is made at compile time, while using typed constant, the compiler does not do conversion in compile time.
And compile time conversions assume ansi cp for the constant string and does conversion when assigning to UTF8String. In runtime it does not because the define added by LCL is activated, which makes treating String variables (and typed consts) as utf8 strings.
Swan, ZX Spectrum emulator https://github.com/zoran-vucenovic/swan

malcome

  • Jr. Member
  • **
  • Posts: 81
Re: new AnsiString question
« Reply #24 on: March 22, 2016, 10:24:09 am »
The new compiler may be shy. :P
He does not tell us even hint about new commendable work. :(

malcome

  • Jr. Member
  • **
  • Posts: 81
Re: new AnsiString question
« Reply #25 on: March 24, 2016, 12:10:13 am »
I will report that you do not have to use UTF8String in usual Lazarus programing.
Is my decision correct?

Bart

  • Hero Member
  • *****
  • Posts: 5609
    • Bart en Mariska's Webstek
Re: new AnsiString question
« Reply #26 on: March 24, 2016, 12:22:39 am »
I have found no use-case for UTF8String type.

Bart

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4631
  • I like bugs.
Re: new AnsiString question
« Reply #27 on: March 24, 2016, 12:30:59 am »
I will report that you do not have to use UTF8String in usual Lazarus programing.
Is my decision correct?

It is even documented:
  http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#Why_not_use_UTF8String_in_Lazarus.3F
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

malcome

  • Jr. Member
  • **
  • Posts: 81
Re: new AnsiString question
« Reply #28 on: March 24, 2016, 02:15:54 am »
I had a meaningful time.
Thank you so much.

malcome

  • Jr. Member
  • **
  • Posts: 81
Re: new AnsiString question
« Reply #29 on: March 24, 2016, 03:52:53 am »
I have new questions.

Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button1Click(Sender: TObject);
  2. var
  3.   s: string;
  4.   ws: UnicodeString;
  5. begin
  6.   ws:= 'abcあいうえお123';
  7.   ShowMessage(ws); // fail
  8.  
  9.   ws:= UnicodeString('abcあいうえお123');
  10.   ShowMessage(ws); // fail
  11.  
  12.   s:= 'abcあいうえお123';
  13.   ws:= UnicodeString(s);
  14.   ShowMessage(ws); // good
  15.  
  16.   ws:= LazUTF8.UTF8ToUTF16('abcあいうえお123');
  17.   ShowMessage(ws); // good
  18. end;

Q1 Does Lazarus(FPC) have the constant string order that like C++ L"ABC" or u"ABC"?
Q2 Which do you recommend, UTF8ToUTF16(v) or UnicodeString(v)?

 

TinyPortal © 2005-2018