Recent

Author Topic: testing "Set UTF8 in RTL"  (Read 14052 times)

malcome

  • Jr. Member
  • **
  • Posts: 81
testing "Set UTF8 in RTL"
« on: October 16, 2015, 08:31:49 am »
Hi,
I am testing "Set UTF8 in RTL" in Lazarus trunc with FPC3.0.0RC1.

Code: Pascal  [Select][+][-]
  1.   Label1.Caption:= 'TEST: ' + IntToStr(1);
  2.   Label2.Caption:= 'テスト: ' + IntToStr(2);
  3.   Label3.Caption:= String('テスト: ') + IntToStr(3);
  4.   Label4.Caption:= AnsiString('テスト: ') + IntToStr(4);
  5.   Label5.Caption:= UTF8String('テスト: ') + IntToStr(5);


Of cause warnning is up on line 2,
Quote
Warning: Implicit string type conversion from "AnsiString" to "UnicodeString"
Warning: Implicit string type conversion with potential data loss from "UnicodeString" to "TTranslateString"

If it's just as it is, it may be necessary to change many parts of an old code for optimization
(Line 2, to 3, 4 or 5).

Can you do anything for us?

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4695
  • I like bugs.
Re: testing "Set UTF8 in RTL"
« Reply #1 on: October 16, 2015, 01:54:33 pm »
I am also puzzled by this issue. I asked Mattias and he was able to explain it somehow. I copied his answer below.
The fundamental reason is that FPC is not designed for default UTF-8 string. I recommend assigning a literal const always to a string variable without anything else. It goes without warnings. Then you can use the variable freely.
Like :
  ChinaText := 'テスト: ';

Ignoring the warnings is OK, too. The conversions are quite fast and seldom occur inside performance critical loops.

--- Answer from Mattias ---

The compiler can't know if source encoding and runtime encoding is the
same.
It compiles for the worst case, which means non Unicode codepage for
AnsiString and there are only UTF16 conversion functions available.

You could hide these warnings.

Or do not add -FcUTF8. Then the compiler assumes source encoding =
system encoding. And use {$codepage UTF8} for those source files with
widestring literals.

Or encode the string:
Caption:= #$e3#$83#$86#$e3#$82#$b9#$e3#$83#$88{テスト} +':'+
IntToStr(2);
« Last Edit: October 16, 2015, 02:32:03 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

malcome

  • Jr. Member
  • **
  • Posts: 81
Re: testing "Set UTF8 in RTL"
« Reply #2 on: October 16, 2015, 02:29:07 pm »
Hi JuhaManninen, Thanks for your reply.

If it's true, The new RTL is less than garbage in East Asia.
We want to improve.
 
« Last Edit: October 16, 2015, 02:32:38 pm by malcome »

malcome

  • Jr. Member
  • **
  • Posts: 81
Re: testing "Set UTF8 in RTL"
« Reply #3 on: October 17, 2015, 01:45:45 am »
Next similar issue,

Code: Pascal  [Select][+][-]
  1.   Application.MessageBox(PChar('Test' + IntToStr(1)), nil);
  2.   Application.MessageBox(PChar('テスト' + IntToStr(2)), nil);
  3.   Application.MessageBox(PChar(String('テスト') + IntToStr(3)), nil);
  4.   Application.MessageBox(PChar(AnsiString('テスト') + IntToStr(4)), nil);
  5.   Application.MessageBox(PChar(UTF8String('テスト') + IntToStr(5)), nil);

Warning is up on Line2:
Quote
Warning: Implicit string type conversion from "AnsiString" to "UnicodeString"

Of cause, Line 2 has not problem in ANSI RTL, but showing text is garble in New RTL.
In addition, Line 3, 4 or 5 have not problem in New RTL.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4695
  • I like bugs.
Re: testing "Set UTF8 in RTL"
« Reply #4 on: October 19, 2015, 01:50:32 am »
Code: Pascal  [Select][+][-]
  1.   Application.MessageBox(PChar('テスト' + IntToStr(2)), nil);
  2.  

Of cause, Line 2 has not problem in ANSI RTL, but showing text is garble in New RTL.
In addition, Line 3, 4 or 5 have not problem in New RTL.

Again, assign the string constant to a variable alone first. Then everything works as expected.

BTW, "New RTL" is not accurate.
"UTF-8 default string encoding in RTL" is closer to what actually happens.
« Last Edit: October 19, 2015, 11:06:34 am by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

malcome

  • Jr. Member
  • **
  • Posts: 81
Re: testing "Set UTF8 in RTL"
« Reply #5 on: October 19, 2015, 02:49:17 am »
Does that mean that this is by design?
It's bad design for East Asia. We'll be in the string constant variable hell(Or type casting hell).
« Last Edit: October 19, 2015, 05:24:08 am by malcome »

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2770
    • havefunsoft.com
Re: testing "Set UTF8 in RTL"
« Reply #6 on: October 19, 2015, 04:55:37 am »
is "Set UTF8 in RTL" fpc 3.0 friendly?
I mean FPC 3.0 should be code-page aware, thus no extra workaround is needed?

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4695
  • I like bugs.
Re: testing "Set UTF8 in RTL"
« Reply #7 on: October 19, 2015, 11:00:41 am »
Does that mean that this is by design?
It's bad design for East Asia. We'll be in the string constant variable hell(Or type casting hell).

Not really by design. It is a shortcoming of our new "UTF-8 hack" and FPC which is not designed for it exactly.
Another choice is to use FPC 3.0 without this new UTF-8 system. It leads to a swamp of hard-to-predict problems. As an example see this issue and its related issues :
  http://bugs.freepascal.org/view.php?id=28406
A third choice is to continue using FPC 2.6.4 and Lazarus with its UTF-8 specific functions. We will make sure Lazarus can be compiled and used with FPC 2.6.4 for a long time to come.

How many constructs like "'テスト' + IntToStr(2)" do you have? There cannot be so many. To me it looks like an easy job to assign the constant to a variable first.
Besides, Application.MessageBox with its PChar is a poor example because LCL provides an equivalent with String parameter for this and for many other functions.

is "Set UTF8 in RTL" fpc 3.0 friendly?
I mean FPC 3.0 should be code-page aware, thus no extra workaround is needed?

Yes, this whole thing is made with FPC 3.0 in mind. Later FPC versions may already have a proper UnicodeString RTL and other libs and things can be done differently. That however will take many years to come, our UTF-8 system works already now.
« Last Edit: October 19, 2015, 11:19:15 am by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

mattias

  • Administrator
  • Full Member
  • *
  • Posts: 207
    • http://www.lazarus.freepascal.org
Re: testing "Set UTF8 in RTL"
« Reply #8 on: October 19, 2015, 07:32:00 pm »
FPC 3.0+ is an improvement.
It's pretty simple:
Just use plain String (= AnsiString) and -dEnableUTF8RTL.

Don't use UTF8String and don't use -FcUTF8.
You can use Widestring/UnicodeString, but when assigning a literal to a Wide/Unicode/UTF8String you have to use UTF8Decode.
For example:
Code: Pascal  [Select][+][-]
  1. Label1.Caption:= 'TEST: ' + IntToStr(1);
  2. Label2.Caption:= 'テスト: ' + IntToStr(2);
  3. Label3.Caption:= String('テスト: ') + IntToStr(3);
  4. Label4.Caption:= AnsiString('テスト: ') + IntToStr(4);
  5. Label5.Caption:= UTF8String(UTF8Decode('テスト: ')) + IntToStr(5);
  6.  

The first 4 lines even work on FPC 2.6.4.

The flag -FcUTF8 and {$codepage UTF8} is only useful if you have a lot of Wide/Unicode/UTF8String literals.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4695
  • I like bugs.
Re: testing "Set UTF8 in RTL"
« Reply #9 on: October 19, 2015, 09:22:55 pm »
FPC 3.0+ is an improvement.
It's pretty simple:
Just use plain String (= AnsiString) and -dEnableUTF8RTL.

Don't use UTF8String and don't use -FcUTF8.
You can use Widestring/UnicodeString, but when assigning a literal to a Wide/Unicode/UTF8String you have to use UTF8Decode.

The flag -FcUTF8 and {$codepage UTF8} is only useful if you have a lot of Wide/Unicode/UTF8String literals.

In fact -FcUTF8 or {$codepage UTF8} makes no difference!
Not correct, caused by a mistake, see later post.

The 2nd line
Code: Pascal  [Select][+][-]
  1. Label2.Caption:= 'テスト: ' + IntToStr(2);
gives warnings in any case:
 Implicit string type conversion from "AnsiString" to "UnicodeString"
 Implicit string type conversion with potential data loss from "UnicodeString" to "TTranslateString"

The Application.MessageBox with PChar cast does not work. Again no difference.
This is not very intuitive and I don't fully understand it but no problem. There is an easy workaround which can be documented.

Now my plan is :
1. Reverse the conditional EnableUTF8RTL with DisableUTF8RTL, so the new UTF-8 system is used automatically for projects compiled with FPC 3.0 unless explicitly disabled by DisableUTF8RTL.
2. -FcUTF8 can be added from a button in Custom Options page although it is not required for the UTF-8 system to work.
3. This system is documented in wiki.
4. The problems raised when using the define DisableUTF8RTL + FPC 3.0 and their possible solutions are also documented in wiki.

How sounds?
« Last Edit: March 30, 2016, 10:02:09 am by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

mattias

  • Administrator
  • Full Member
  • *
  • Posts: 207
    • http://www.lazarus.freepascal.org
Re: testing "Set UTF8 in RTL"
« Reply #10 on: October 19, 2015, 10:06:27 pm »
With and without -FcUTF8 makes a difference in the original example:

Code: Pascal  [Select][+][-]
  1.   Application.MessageBox(PChar('テスト' + IntToStr(2)), nil);
  2.  

With -FcUTF8 the constant becomes a UnicodeString. Typecasting to a PChar does not convert it, but accesses directly the bytes of the UTF-16 string.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4695
  • I like bugs.
Re: testing "Set UTF8 in RTL"
« Reply #11 on: October 20, 2015, 07:39:25 pm »
With and without -FcUTF8 makes a difference in the original example:

Right, I had -FcUTF8 in CompilerOptions as my local change. Now I can see the difference, yes.
Things work better without -FcUTF8 and malcome's original problem is solved, too.

I switched to use the UTF-8 system by default in r50129.
-FcUTF8 is not set by default but there is a button in "Custom Options" page to set it.
There is also a button in "Additions and Overrides" page to set -dDisableUTF8RTL for people who want to use system encoding for strings.

Everybody please test. I start to update the wiki page tomorrow.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

malcome

  • Jr. Member
  • **
  • Posts: 81
Re: testing "Set UTF8 in RTL"
« Reply #12 on: October 28, 2015, 09:37:44 am »
Quote
How many constructs like "'テスト' + IntToStr(2)" do you have? There cannot be so many. To me it looks like an easy job to assign the constant to a variable first.
Besides, Application.MessageBox with its PChar is a poor example because LCL provides an equivalent with String parameter for this and for many other functions.

An easy and a poor?
You do not understand the problem in East Asia.
I have given up. Suit yourself. >:D

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4695
  • I like bugs.
Re: testing "Set UTF8 in RTL"
« Reply #13 on: October 28, 2015, 11:15:02 am »
An easy and a poor?
You do not understand the problem in East Asia.
I have given up. Suit yourself. >:D

It is true I don't fully understand the encoding problems in East Asia. However it looks like leaving out -FcUTF8 solves the problem!
See my previous post. It says "... and malcome's original problem is solved, too."
Could you please test it.

I have changed Lazarus IDE so that this UTF-8 system is used automatically with FPC 3.0 and -FcUTF8 is not added automatically. I have still not updated the wiki, sorry about that. :(

Your feedback has been valuable so far. Other feedback has been mostly from people who don't like this new UTF-8 system at all, but they are not working on any alternative solution either.
So, don't give up yet!
« Last Edit: October 28, 2015, 01:04:45 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12717
  • FPC developer.
Re: testing "Set UTF8 in RTL"
« Reply #14 on: October 28, 2015, 11:26:19 am »
I'm not entirely clear what the problem is? Of course there will be some warnings, because with the utf8 hack there are two utf8 types (ansistring(0) and ansistring(65001)).

But the conversions will be very cheap, there is only minor call overhead (jump to conversion routine, conversion routine realizes that they are the codepages are the same and just moves the data).

So what is the actual problem, or am I misunderstanding something?

 

TinyPortal © 2005-2018