Recent

Author Topic: new AnsiString question  (Read 42347 times)

malcome

  • Jr. Member
  • **
  • Posts: 81
new AnsiString question
« on: March 14, 2016, 03:13:10 am »
I am using new Lazarus 1.6.0(win32).
And I am trying new AnsiString. My code is below.
Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button1Click(Sender: TObject);
  2. const
  3.   cs = 'abcあいうえお123';
  4. var
  5.   s: string;
  6.   s8: UTF8String;
  7. begin
  8.   ShowMessage(IntToStr(StringCodePage(cs))); // good
  9.   ShowMessage(cs); // good
  10.  
  11.   s:= cs;
  12.   ShowMessage(IntToStr(StringCodePage(s))); // good
  13.   ShowMessage(s); // good
  14.  
  15.   SetCodePage(RawByteString(s), CP_UTF8);
  16.   ShowMessage(IntToStr(StringCodePage(s))); // good
  17.   ShowMessage(s); // good
  18.  
  19.   s8:= cs;
  20.   ShowMessage(IntToStr(StringCodePage(s8))); // good
  21.   ShowMessage(s8); // corrupted, what???
  22. end;
  23.  
  24.  

I cannot understand result of last ShowMessage function.
Can anybody explain this to me?
« Last Edit: March 14, 2016, 03:46:20 am by malcome »

Zoran

  • Hero Member
  • *****
  • Posts: 1911
    • http://wiki.lazarus.freepascal.org/User:Zoran
Re: new AnsiString question
« Reply #1 on: March 14, 2016, 08:03:05 am »
I can confirm this (FPC 3.0, Lazarus 1.6), and I noticed:
  - if you change the constant declaration to "cs: String = 'abcあいうえお123';", the last message will be "good"!
  - OR if you change the line "s8 := cs" with "s8 := s", then the last message will also be "good"!

I meant to explain that const cs is by default system code page string (not RowByteString), but then your fist message tells us that it actually is a RawByteString (code page 0). So I am also confused...
« Last Edit: March 14, 2016, 08:06:06 am by Zoran »
Swan, ZX Spectrum emulator https://github.com/zoran-vucenovic/swan

Bart

  • Hero Member
  • *****
  • Posts: 5538
    • Bart en Mariska's Webstek
Re: new AnsiString question
« Reply #2 on: March 14, 2016, 11:19:10 am »
Does it help if you insert {$codepage utf8}?

Bart

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4565
  • I like bugs.
Re: new AnsiString question
« Reply #3 on: March 14, 2016, 08:07:59 pm »
I meant to explain that const cs is by default system code page string (not RowByteString), but then your fist message tells us that it actually is a RawByteString (code page 0). So I am also confused...

What first message? The constant string is not RawByteString, it is an UTF-8 string because the source file is encoded as UTF-8.
However the compiler interprets it as having system code page, then does a wrong conversion from system code page to UTF-8.
{$codepage utf8} or -FcUTF8 override that, then the compiler treats constant strings as UTF-8 and the last assignment would go right.

The new UTF-8 support in Lazarus works without {$codepage utf8} or -FcUTF8 most of the time. Why? It is rather counter-intuitive.
The reason is that the default String encoding is switched to UTF-8 at run-time, yet the constants are evaluated at compile-time.
So the compiler (wrongly) thinks the constant String is encoded with system code page. Then it sees a String variable with default encoding (which will be changed to UTF-8 at run-time but the compiler does not know it). Thus, same default encodings, no conversion needed, the compiler happily copies the characters and everything goes right, while actually it was fooled twice during the process.
Here are some details about the issue:
  http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#String_Literals

This was another example why not to use UTF8String. In any case it leads to useless encoding checks in the generated code and in worst case it leads to wrong conversion.

People are asking what is the future proof String type. It is plain String. Just pretend it is Delphi compatible and it works like magic most of the time.
The 2 exceptions are:
1. Input/output string data has system encoding. Then it must be explicitly converted.
2. Dealing with individual codepoints beyond ASCII area. Fortunately that is not needed often in "normal" "typical" programming.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Zoran

  • Hero Member
  • *****
  • Posts: 1911
    • http://wiki.lazarus.freepascal.org/User:Zoran
Re: new AnsiString question
« Reply #4 on: March 14, 2016, 08:33:32 pm »
I meant to explain that const cs is by default system code page string (not RowByteString), but then your fist message tells us that it actually is a RawByteString (code page 0). So I am also confused...

What first message? The constant string is not RawByteString, it is an UTF-8 string because the source file is encoded as UTF-8.

Juha, I think that you misunderstood me. I wanted to say that I had meant exactly what you described, but then the first message in procedure disproved me.

I meant it is seen by the compiler as system code page string, not RawByteString. However, by "the first message" in program, I mean the line
Code: Pascal  [Select][+][-]
  1. ShowMessage(IntToStr(StringCodePage(cs)));
and it shows "0", which is RowByteString. So, it seems that compiler does not see it as system code page string, but as RawByteString after all?! That is what confuses me.
Swan, ZX Spectrum emulator https://github.com/zoran-vucenovic/swan

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4565
  • I like bugs.
Re: new AnsiString question
« Reply #5 on: March 14, 2016, 08:40:18 pm »
Code: Pascal  [Select][+][-]
  1. ShowMessage(IntToStr(StringCodePage(cs)));
and it shows "0", which is RowByteString. So, it seems that compiler does not see it as system code page string, but as RawByteString after all?! That is what confuses me.

Ok, true.
Somehow it just works when using String. Maybe there is some magic involved. :)
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

malcome

  • Jr. Member
  • **
  • Posts: 81
Re: new AnsiString question
« Reply #6 on: March 17, 2016, 02:59:31 am »
Thank you guys. :)
But I want to know the reason why strange result occurred, not workaround.
Because I will report about the new AnsiString to Japanese users. ;)

Zoran

  • Hero Member
  • *****
  • Posts: 1911
    • http://wiki.lazarus.freepascal.org/User:Zoran
Re: new AnsiString question
« Reply #7 on: March 17, 2016, 08:43:37 am »
Thank you guys. :)
But I want to know the reason why strange result occurred, not workaround.
Because I will report about the new AnsiString to Japanese users. ;)

Yes, that is what I am also interested in.
Swan, ZX Spectrum emulator https://github.com/zoran-vucenovic/swan

loopbreaker

  • New Member
  • *
  • Posts: 32
Re: new AnsiString question
« Reply #8 on: March 17, 2016, 09:09:34 am »
Looks like a mess (caused by unsafe CP_ACP ansistring).

BTW, I wished the lead developers would decide to provide two RTL apis:
UTF8String and UnicodeString, this would solve all unicode problems finally.
The current hack via ACP is really bad.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4565
  • I like bugs.
Re: new AnsiString question
« Reply #9 on: March 17, 2016, 09:44:34 am »
But I want to know the reason why strange result occurred, not workaround.

I explained the reason in my earlier post. Please read it again.
What workaround are you referring to?

Looks like a mess (caused by unsafe CP_ACP ansistring).

What means "unsafe CP_ACP"?

Quote
BTW, I wished the lead developers would decide to provide two RTL apis:
UTF8String and UnicodeString, this would solve all unicode problems finally.
The current hack via ACP is really bad.

Do you mean all developers should change type "String" into either "UTF8String" or "UnicodeString" in all their programs?

Quote
The current hack via ACP is really bad.

No it is not. Only problem comes if your code depends heavily on Windows system codepage data, and then you can define "DisableUTF8RTL".
Did you actually try this system?
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Zoran

  • Hero Member
  • *****
  • Posts: 1911
    • http://wiki.lazarus.freepascal.org/User:Zoran
Re: new AnsiString question
« Reply #10 on: March 17, 2016, 09:54:19 am »
Looks like a mess (caused by unsafe CP_ACP ansistring).

BTW, I wished the lead developers would decide to provide two RTL apis:
UTF8String and UnicodeString, this would solve all unicode problems finally.
The current hack via ACP is really bad.

Hm, I really like the way Lazarus took - use row strings and treat them as utf8. My language uses two writing systems - I mix these two all the time and utf8 is perfect for me.

But this looks like possible FPC bug:
If cp is declared as true constant: "const cs = 'abcあいうえお123';", then the assignment to UTF8String makes some (unwanted) conversion. But if cp is declared as typed constant: "const cs: String = 'abcあいうえお123';", then the assignment to UTF8String doesn't try to do any conversion.

The explanation would be that compiler sees the true constant as ansi encoded, and typed constant as row byte string. But the message shows "0", in both cases! That tells us that cp is seen by the compiler as RowByteString in both cases, so the conversion should not take place.  :-\

That is what I do not understand. Is it a FPC but? Should it be reported in bug tracker?
« Last Edit: March 17, 2016, 10:11:04 am by Zoran »
Swan, ZX Spectrum emulator https://github.com/zoran-vucenovic/swan

Zoran

  • Hero Member
  • *****
  • Posts: 1911
    • http://wiki.lazarus.freepascal.org/User:Zoran
Re: new AnsiString question
« Reply #11 on: March 17, 2016, 10:04:09 am »
But I want to know the reason why strange result occurred, not workaround.

I explained the reason in my earlier post. Please read it again.
What workaround are you referring to?
I do not see that you explained the reason for what I said in my previous post - the different behaviour between true constant and typed constant, when compiler says that they are both row byte strings. The assignment should not make any conversions then in both cases.

The workaround he is reffering to is - use typed const instead of true const.

I have no problem - I just use String all the time and treat it as utf8, never use UTF8String type, and have no problems. Still, I would like to now the explanation of different behaviour between true string const and typed string const.
« Last Edit: March 17, 2016, 10:09:50 am by Zoran »
Swan, ZX Spectrum emulator https://github.com/zoran-vucenovic/swan

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4565
  • I like bugs.
Re: new AnsiString question
« Reply #12 on: March 17, 2016, 10:34:40 am »
... the different behaviour between true constant and typed constant, when compiler says that they are both row byte strings. The assignment should not make any conversions then in both cases.

Ok, I don't have explanation for that one. You must ask FPC developers.

Quote
The workaround he is reffering to is - use typed const instead of true const.

Is that a workaround? In any case it is not needed because you should use "String" variable when assigning from constant.
You can use UTF8String when assigning between variables but even then it makes no sense. It generates useless encoding checks in your compiled code but does not bring any benefits.

It means this problem is artificial. Just use "String" and everything works like magic. It is also amazingly Delphi compatible at source level because typical code does not study codepoints beyond ASCII. For example the ~2000000 LOC Lazarus project itself does not do it. We could remove all calls to the special UTF8...() functions from the IDE's code and it would work. Now we keep them for compatibility with FPC 2.6.4.

The people who claim this UTF-8 system is "really bad", clearly have not tested it. Such claims were understandable when this was under construction and there was nothing to test. Now it can be tested!
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Bart

  • Hero Member
  • *****
  • Posts: 5538
    • Bart en Mariska's Webstek
Re: new AnsiString question
« Reply #13 on: March 17, 2016, 10:42:06 am »
.... For example the ~2000000 LOC Lazarus project itself does not do it. We could remove all calls to the special UTF8...() functions from the IDE's code and it would work. Now we keep them for compatibility with FPC 2.6.4.

Are you sure?
Would that even work with file handling routines with filenames outside current codepage (e.g. chinese characters on wester european Windows)?

Bart

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4565
  • I like bugs.
Re: new AnsiString question
« Reply #14 on: March 17, 2016, 10:55:34 am »
Are you sure?
Would that even work with file handling routines with filenames outside current codepage (e.g. chinese characters on wester european Windows)?

Yes. Why not? All textual input/output data used by Lazarus IDE is UTF-8. The special UTF-8 related functions are now empty and dummy, just for backwards compatibility.
Only exception I know of is some RTL / FCL code that still calls the old "A" versions of Windows API functions. They don't support Unicode (regardless of encoding).
If there are other exceptions then I have not understood everything about Unicode. Wouldn't be the first time ...
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

 

TinyPortal © 2005-2018