Recent

Author Topic: What is UTF-8 Application  (Read 25521 times)

Deepaak

  • Sr. Member
  • ****
  • Posts: 454
What is UTF-8 Application
« on: February 03, 2015, 05:58:20 am »
Today i upgraded my lazarus to svn revision 47587 and found a change
in project - new project there is a new entry

Quote
UTF-8 Application

What is the difference between Application project and UTF8 Application project
Holiday season is online now. :-)

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: What is UTF-8 Application
« Reply #1 on: February 03, 2015, 06:43:48 am »
It seems from the changes to the source code that it defines EnableUTF8RTL and it passes -FcUTF8 to the compiler. More info from the wiki:
Quote
Usually the RTL uses the system codepage for strings (e.g. FileExists and TStringList.LoadFromFile). On Windows this is a non Unicode encoding, so you can only use characters from your language group. The LCL works with UTF-8 encoding, which is the full Unicode range. On Linux and Mac OS X UTF-8 is typically the system codepage, so the RTL uses here by default CP_UTF8.

Since FPC 2.7.1 the default system codepage of the RTL can be changed to UTF-8 (CP_UTF8). So Windows users can now use UTF-8 strings in the RTL.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4660
  • I like bugs.
Re: What is UTF-8 Application
« Reply #2 on: February 03, 2015, 08:40:30 am »
What is the difference between Application project and UTF8 Application project

Wow, you people follow the commit history diligently!
The idea is to document the new improved UTF-8 support in the Better_LCL_Unicode_Support wiki page and to make it easier to test for anybody.
Now -dEnableUTF8RTL in the "UTF8 Application" project is done wrong and fails to work. It should go to the "Additions and Overrides" so it is passed to all dependent libraries, including LCL.
I will fix it later today.
If you want to test it right now you must build Lazarus with -dEnableUTF8RTL. Add it to defines in Configure Build Lazarus window for example.
It is enough for simple tests because lazutils.lpk and lcl.lpk have $(IDEBuildOptions). I am testing like that, too.

Finally for Lazarus 2.0 this improved UTF-8 support will be the default. The define and project description are only temporary.

The feature itself has not changed much since Mattias added it last November and asked help for testing it.
  http://lists.lazarus.freepascal.org/pipermail/lazarus/2014-November/089394.html
Now I have tested it and found no problems. It is almost too good to be true! I thought it requires much more effort.
In fact the effort was put into Lazarus UTF-8 support earlier, now it was easy to port for new FPC features.
As a result an application code will be almost compatible with Unicode Delphi code but faster and more robust meaning less Unicode related bugs.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12599
  • FPC developer.
Re: What is UTF-8 Application
« Reply #3 on: February 03, 2015, 10:03:17 am »
IMHO a disaster. Changing global settings makes it hard to make packages out of various corners compatible to eachother.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4660
  • I like bugs.
Re: What is UTF-8 Application
« Reply #4 on: February 03, 2015, 10:16:02 am »
IMHO a disaster. Changing global settings makes it hard to make packages out of various corners compatible to eachother.

Maybe you misunderstood. The purpose is set a define for a project and all its dependent packages.
The "Additions and Overrides" can do that.
After that all the packages will have String encoded as UTF-8 and the RTL works with UTF-8 specific functions. Yes, it is quite amazing!
There will not be packages in "various corners". They will all have the same settings.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12599
  • FPC developer.
Re: What is UTF-8 Application
« Reply #5 on: February 03, 2015, 10:25:15 am »

There will not be packages in "various corners". They will all have the same settings.

That is the EXACT problem. It will probably break packages like fcl-registry and every other winapi package that calls winapi with pchar(s[1])

And all code of non Lazarus origin needs to be modified. This is stupid. if the LCL wants to use an utf 8 string, it should use utf8string.

The FPC rtl modifications were done to make that working fine.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4660
  • I like bugs.
Re: What is UTF-8 Application
« Reply #6 on: February 03, 2015, 11:10:04 am »
That is the EXACT problem. It will probably break packages like fcl-registry and every other winapi package that calls winapi with pchar(s[1])

Hmmm, "Additions and Overrides" affects only Lazarus packages (.lpk). I think fcl-registry and others will continue to work with FPC default settings, although I am not sure.
Right, the whole RTL is set for UTF-8 with
  SetMultiByteConversionCodePage(CP_UTF8);
I think WinAPI packages with pchar(s[1]) etc. should be modified so they can work with default CP_UTF8 Strings. There is a limited number of such code, it is manageable.
Using the default CP_UTF8 for LCL is a near-perfect plan, it is a pity if those corner-case WinAPI calls make it void.
Mattias must explain his view and ideas here.

Unicode is difficult.
Richard Feynman remarked, "I think I can safely say that nobody understands quantum theory".
I would like to remark now, "I think I can safely say that nobody understands Unicode".

Quote
And all code of non Lazarus origin needs to be modified. This is stupid. if the LCL wants to use an utf 8 string, it should use utf8string.

No, using String requires the minimal amount of changes. Using Utf8String or LclString as an alias for Utf8String would require MUCH more changes.
Now the code is almost compatible with Delphi. Ansi...() functions and everything work.
Only when dealing with individual codepoints the code must be different, but that is inevitable because the encodings are different by definition.
« Last Edit: February 03, 2015, 11:48:20 am by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4660
  • I like bugs.
Re: What is UTF-8 Application
« Reply #7 on: February 03, 2015, 12:14:43 pm »
And all code of non Lazarus origin needs to be modified. This is stupid. if the LCL wants to use an utf 8 string, it should use utf8string.

The FPC rtl modifications were done to make that working fine.

Another POV: why does FPC allow setting the default String codepage / encoding if it cannot be used?
If it is a supported feature then it should work. If it does not work then you have a bug.
The SetMultiByteConversionCodePage() function should be either supported or then removed completely.
« Last Edit: February 03, 2015, 12:24:01 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12599
  • FPC developer.
Re: What is UTF-8 Application
« Reply #8 on: February 03, 2015, 12:26:20 pm »
Hmmm, "Additions and Overrides" affects only Lazarus packages (.lpk). I think fcl-registry and others will continue to work with FPC default settings, although I am not sure.
Right, the whole RTL is set for UTF-8 with
  SetMultiByteConversionCodePage(CP_UTF8);

No. Only a few file related functions in system and sysutils. There was never any plan to support this throughout.

Quote
I think WinAPI packages with pchar(s[1]) etc. should be modified so they can work with default CP_UTF8 Strings.

I think not. First I think utf8 on Windows is not sane. When Mattias first came with that idea, I agreed, but only as transitional functionality till LCL can truely dual compile between unicodestring and utf8string/string.

Quote
There is a limited number of such code, it is manageable.
Using the default CP_UTF8 for LCL is a near-perfect plan,

As said it is not a sane solution long term. Windows does not do UTF8 on API level. Period. Use a native encoding, don't try to emulate linux.

Quote
it is a pity if those corner-case WinAPI calls make it void.

A better solution would be to make them unicodestring. I don't see any future for utf8 hacks in windows specific code. (And imho not even in general code)


JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4660
  • I like bugs.
Re: What is UTF-8 Application
« Reply #9 on: February 03, 2015, 01:02:22 pm »
I think not. First I think utf8 on Windows is not sane. When Mattias first came with that idea, I agreed, but only as transitional functionality till LCL can truely dual compile between unicodestring and utf8string/string.

I honestly don't understand why you are so against UTF-8 on Windows. This puzzles me because you know about Unicode more than I do. Is there again something new in Unicode that I don't understand?

Quote
As said it is not a sane solution long term. Windows does not do UTF8 on API level. Period. Use a native encoding, don't try to emulate linux.

Who cares about the API level? New FPC is able to convert strings automatically for the API calls, thanks to you and other FPC devels.
UTF-8 has many benefits regardless of operating system. I will not list them here because you already know them better than I do.
Besides UTF-8 is already used a lot on Windows. Most text editors use it as a default encoding for files. In practice all XML is UTF-8 etc.

Quote
A better solution would be to make them unicodestring. I don't see any future for utf8 hacks in windows specific code. (And imho not even in general code)

That was my idea, too. Use UnicodeString and it works always. Now it breaks with a valid documented supported (?) FPC configuration which is a bug and must be fixed.
UTF-8 hacks should not be needed anywhere.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12599
  • FPC developer.
Re: What is UTF-8 Application
« Reply #10 on: February 03, 2015, 01:37:40 pm »
I honestly don't understand why you are so against UTF-8 on Windows.

Because it is unnatural. As said there is no utf8 usage in code on Windows except for ported Unix software. Worse, it is Delphi incompatible.  1-byte Delphi is getting rarer and rarer, and any change will need many years to effectuate.

Even on Unix many codebases are utf16 (like QT), as well as many new languages (C#,Java, Objective C)
Quote
Who cares about the API level? New FPC is able to convert strings automatically for the API calls, thanks to you and other FPC devels.

Well, we are talking about having to adapt API calling routines aren't we? And all other external code, including (and most important) Delphi code.

Quote
UTF-8 has many benefits regardless of operating system.

Those arguments confuse plain text formats with usage as base string type in code. Most quoted advantages are for plain text format.

Quote
Besides UTF-8 is already used a lot on Windows. Most text editors use it as a default encoding for files. In practice all XML is UTF-8 etc.

I said in code.

Quote
Quote
A better solution would be to make them unicodestring. I don't see any future for utf8 hacks in windows specific code. (And imho not even in general code)

That was my idea, too. Use UnicodeString and it works always. Now it breaks with a valid documented supported (?) FPC configuration which is a bug and must be fixed.

That's the problem. Because the utf8 camp keeps whining, and the people needing Delphi compatibility (including me) will never cease, and because of this stand-off, no real decisions are made.

Now people start assuming hacks made for a transition period are here to stay and policy.

Quote
UTF-8 hacks should not be needed anywhere.

Setting the RTL encoding to utf-8 is still an UTF8 hack.  Changing global state to a non default value is always an hack. There is no good reason for it except legacy 1-byte code.

Sure, FPC is not ready for string=unicodestring, but we should start hacking in the direction of a proper solution (which means utf16 on Windows, and I lean toward utf16 on *nix too, but are less pronounced there), and not exchanging one set of utf8 hacks for another.

I will not cooperate in adding string=utf8 hacks into the FPC codebases, and if it will become some form of policy, I will resign as committer. I hope a made that clear in Croatia.

The minimal work now done for the RTL only works for procedural APIs, and can't be scaled up.
« Last Edit: February 03, 2015, 01:44:57 pm by marcov »

mattias

  • Administrator
  • Full Member
  • *
  • Posts: 206
    • http://www.lazarus.freepascal.org
Re: What is UTF-8 Application
« Reply #11 on: February 03, 2015, 02:07:04 pm »
This is stupid. if the LCL wants to use an utf 8 string, it should use utf8string.

The LCL uses the Classes unit, which uses String.
Passing UTF8String to a String adds a codepage check. If DefaultSystemCodepage<>CP_ACP, then the UTF-8 will be converted, loosing all characters beyond the Window Codepage.
That is worse than the old solution.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12599
  • FPC developer.
Re: What is UTF-8 Application
« Reply #12 on: February 03, 2015, 02:09:40 pm »
This is stupid. if the LCL wants to use an utf 8 string, it should use utf8string.

The LCL uses the Classes unit, which uses String.
Passing UTF8String to a String adds a codepage check. If DefaultSystemCodepage<>CP_ACP, then the UTF-8 will be converted, loosing all characters beyond the Window Codepage.
That is worse than the old solution.

Yes. But that is all temporary till classes is unicodestring. And we won't adapt FPC code to that scheme.

We never supported string containing utf8 in the old situation either, and that it happened to work then is abusing undocumented behaviour.


mattias

  • Administrator
  • Full Member
  • *
  • Posts: 206
    • http://www.lazarus.freepascal.org
Re: What is UTF-8 Application
« Reply #13 on: February 03, 2015, 03:21:09 pm »
I lean toward utf16 on *nix too, but are less pronounced there

What should that mean?
Just write it: You won't support UTF-8, everything should become UTF-16 for Delphi compatiblity.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4660
  • I like bugs.
Re: What is UTF-8 Application
« Reply #14 on: February 03, 2015, 03:27:01 pm »
Yes. But that is all temporary till classes is unicodestring. And we won't adapt FPC code to that scheme.

Ok, this plan may be the real reason to oppose UTF-8 so much. I did not realize there is so strong confrontation between the "camps", especially as FPC provided the functions helping with default encoding.

I have a personal interest with SW that will need UTF-8 all over for various reasons. I have no interest to oppose other solutions. First I thought UTF-8 must be done using the old way with AnsiString + UTF...() functions. Then I learned RTL could be mapped to UTF-8 and it worked better I had hoped. As I wrote earlier here "It is almost too good to be true!" and yes it was too good to be true...

I promise to work towards the UTF-16 solution later, but first I need the UTF-8. In the worst case it is the old AnsiString + UTF...() functions but then I am a little disappointed. We were so close to get this working:
  http://wiki.freepascal.org/Better_LCL_Unicode_Support

I also try to keep this as pragmatic as possible, there have been enough "camp" fight during past 5 years.
So, I am here to find a working UTF-8 solution, not to fight against other solutions.
Still, the functions for changing encoding should be removed if their usage is forbidden.

Marco, you know Unicode better than I but I have also learned something. These things are valid in my use-case, don't know if they are valid for others. So don't get angry.

1. The solution we made is amazingly Delphi compatible. String Ansi... functions work and the ASCII functions. Even Pos() and Copy() are compatible in most cases. In Delphi they are used because people treat UTF-16 as a fixed width encoding, with UTF-8 they work most often because of the special properties of this encoding.
When looking at some real Delphi code, there are very few things to change.

2. 100% Delphi compatibility is not always a blessing, it can be a curse. Typical Delphi code still assumes a character is fixed width 16 bits. Tutorials and examples feed that same wrong idea. For example an article from Nick Hodges :
  http://edn.embarcadero.com/article/38693
says:  "Copy will still work as before without change. So will Delete and all the SysUtils-based string manipulation routines."
I know codepoints with 2 UnicodeChar are rare in west, but maybe the application is marketed to China some day and then the code breaks. Copy() will get half a codepoint.
UTF-8 code must be done right always when dealing with individual codepoints.

3. I have done cross-platform code that reads an XML file, parses it and does something with the data. This all using UTF-8 encoding of LCL.
The file is already encoded as UTF-8, there is no single conversion needed for it. I think the file-open WinAPI call is the only place where filename encoding must be converted. The actual file-read block operation does not care about encodings (I think).
This code is not specific to Unix or any other operating system. So, I honestly don't understand your sentense:
  "no utf8 usage in code on Windows except for ported Unix software".

Anyway, I will take what is given from FPC team. I understand there are camps inside the team which complicates the issue.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

 

TinyPortal © 2005-2018