Please look closer at current (3.x based) Lazarus versions. They set ACP to UTF8.
I think this is case 1 (ACP AnsiString with (ACP = UTF8) set at runtime),
but I propose an internal: type String = Utf8String (utf8 declared (known) at compile time),
analogous to the internal type String = UnicodeString (utf16 known at compile time).
Generally, "String" (Char, PChar compatible) is just an optional alias, defined (mostly implicitly) per unit.No. It is the other way around. The UTF8 string type is the optional type.
It's an alias for convenience only, because most users prefer this short spelling over UnicodeString, Utf8String, AnsiString or others.
Thaddy, I posted about a realistic improvement to current units of Lazarus (utf8) users: Just insert a directive {$modeswitch utf8strings}in your existing units, that's all. By this you have no hack via ACP runtime-change anymore.
Marco prefers {$modeswitch unicodestrings}, because he prefers utf16 over utf8.
In this unit, Lazarus users must port their old (String = acpUtf8) code, having two options (utf16 or utf8):
a) keep "String", but ensure (possibly modify) the related code still works with utf16 instead of utf8
b) rename "String" to "Utf8String"
This is an improvement over the current status, but the question is, how long (for Lazarus devs and users) is the transition phase to (String = utf16) and how many will participate.
My proposal (extension) was for the many users which want to keep their Utf8 decision with typename "String" (here Utf8String internally).
To resume, in the long term, we have only three "String" options:
A) stay with old (String = AnsiString = acpUtf8) hack
B) use my (String = Utf8String) extension
C) use String (utf16) and Utf8String
If option B is used and Delphi compatibility required at a later stage,
then the user can opt for a search-and-replace (String by Utf8String (or customname)).
Well that is the core problem. The utf8 hack was presented as a transition horse but kills all motivation to make haste, or to even allow people to start major development if it is doubtful if it will be accepted back.You guys (the FPC team) keep going on about the "utf8 hack", but none of you ever propose an exact alternative. So in concrete terms, what is your alternative suggestion Macro to the "utf8 hack" implemented by LCL? Some developers prefer to use the UTF-8 encoding internally in their application to truly support all of Unicode, without the UTF-16 surrogate-pair mess, which most applications using UTF-16 don't actually support (so in turn they only really handle UCS-2).
Well that is the core problem. The utf8 hack was presented as a transition horse but kills all motivation to make haste, or to even allow people to start major development if it is doubtful if it will be accepted back.You guys (the FPC team) keep going on about the "utf8 hack", but none of you ever propose an exact alternative.
So in concrete terms, what is your alternative suggestion Macro to the "utf8 hack" implemented by LCL? Some developers prefer to use the UTF-8 encoding internally in their application to truly support all of Unicode, without the UTF-16 surrogate-pair mess, which most applications using UTF-16 don't actually support (so in turn they only really handle UCS-2).
That is like saying that utf8 like the current LCL version are not allowed because old code might then do double conversions.That's basically what it boils down to...
(Not to forget, Utf8String is Delphi-compatible as well)No it is not. And there is no {$modeswitch utf8string}. UTF8 is not in the compiler.
Duh, unicodestring as basetype.Duh, and that means only using UTF-16 and again we are stuck with the surrogate pair issue, which FPC doesn't actually help with at all. FPC doesn't know anything about surrogate pairs (as per the recent mailing list conversations).
...but it is counterproductive as default.And using UnicodeString, which is UTF-16 only, is any better? Where the developer now has to manually check everywhere for surrogate pairs, because FPC knows nothing about surrogate pairs. I'll rather use UTF-8 thanks, where it supports the full Unicode range without me having to jump through any hoops.
You can't have both Delphi compatibility and utf-8.Personally, I don't give a sh*t about "Delphi compatibility". I shelved Delphi over 10 years ago, and have used FPC exclusively ever since. I see no need for Delphi these days.
java,C#, C and C++ developers see no need for pascal those days either what is your point.You can't have both Delphi compatibility and utf-8.Personally, I don't give a sh*t about "Delphi compatibility". I shelved Delphi over 10 years ago, and have used FPC exclusively ever since. I see no need for Delphi these days.
erm I did not see a need for the surrogate paired characters in utf16 at all, what ever the fpc supports is more than enough for me. LCL does not support utf8 characters correctly either. for example try to add any utf8 character higher than ordinal 255 in the password character of a TEdit and see how that works for you....but it is counterproductive as default.And using UnicodeString, which is UTF-16 only, is any better? Where the developer now has to manually check everywhere for surrogate pairs, because FPC knows nothing about surrogate pairs. I'll rather use UTF-8 thanks, where it supports the full Unicode range without me having to jump through any hoops.
Duh, unicodestring as basetype.Duh, and that means only using UTF-16 and again we are stuck with the surrogate pair issue, which FPC doesn't actually help with at all. FPC doesn't know anything about surrogate pairs (as per the recent mailing list conversations).
You do also realise I mentioned "developers want to use utf-8 in their applications",
and using UnicodeString (as bad as the name choice was), is only UTF-16. So no, that is not an option.
You can't have both Delphi compatibility and utf-8.Personally, I don't give a sh*t about "Delphi compatibility". I shelved Delphi over 10 years ago, and have used FPC exclusively ever since. I see no need for Delphi these days.
, and using UnicodeString (as bad as the name choice was), is only UTF-16. So no, that is not an option.At least I can agree with that. (the Bad Name part that is) But I, like most of us, come from Delphi, in my case even way before Delphi. In that light UTF8 was a lesser choice.
Also note that surrogate pairs in UTF16 are comparatively rareIf you think like that you have no business in using Unicode. Stick to UCS-2 then and make it clear that your applications only support UCS-2. Don't give people false hope like the commercial text editor I purchased a while back. Almost everything new being added to the Unicode standard is being added outside the BMP range, so your problem is just going to get worse. Your comment is also highly subject, and it heavily depends on what your application is doing. I was recently working with mapping data and math formulas - both commonly used Unicode code points outside the BMP range.
erm I did not see a need for the surrogate paired characters in utf16 at allSee my reply to Thaddy.
:) you have no say on what my application feature list says or does not say, the same way you have no say on what my customers consider unicode ready and what it is not.erm I did not see a need for the surrogate paired characters in utf16 at allSee my reply to Thaddy.
BMP to function properly a string comparing algorithm uses a far wider subset but no one expects to sort math or GIS symbols in any specific order so even that only uses a simple numeric comparison. What is it that it is so "widely used" out there?Quite simple... custom reports where we needed Width and Height calculation to accurately place text, and used a custom written algorithm to reshuffle information so as to use the space on a A4 or A5 page as efficiently as possible.
so you are writing your own layout engine? isn't this already solved on linux and bsd by a third party library? Even mozila used it the last time I checked. what is so special that is not covered by the underline apis? I'm assuming that better space usage covers this but not having spend much time on text layout my self I'll probably need a visual sample to understand. Never mind I'll take your word for it.BMP to function properly a string comparing algorithm uses a far wider subset but no one expects to sort math or GIS symbols in any specific order so even that only uses a simple numeric comparison. What is it that it is so "widely used" out there?Quite simple... custom reports where we needed Width and Height calculation to accurately place text, and used a custom written algorithm to reshuffle information so as to use the space on a A4 or A5 page as efficiently as possible.
We also had a literacy and memory (game) learning application that often used symbols outside the BMP range. Again, these had to be accurately placed and scaled on screen and print.
You can't have both Delphi compatibility and utf-8.
And using UnicodeString, which is UTF-16 only, is any better? Where the developer now has to manually check everywhere for surrogate pairs, because FPC knows nothing about surrogate pairs. I'll rather use UTF-8 thanks, where it supports the full Unicode range without me having to jump through any hoops.
freely assign UTF8String <-> UnicodeString without data loss when needed.
freely assign UTF8String <-> UnicodeString without data loss when needed.
But use it in a expression, and it will be converted using the ACS type.
UTF-8 and UTF-16 both support the full Unicode range (all UTFs do).I obviously know that, but it seems here are some people in this forum that doesn't. Many seem to think UTF-16 is BMP only. %)
If you want to support Unicode properly, you have to treat all UTFs (except for UTF-32) as variable-length, multi-codeunit encodings.Exactly.
If you need string to be AnsiString and want to stick to the Lazarus world, simply use AnsiString and AnsiChar instead of string and char. These are 1 byte based.
You should do so anyway, because it is the only way to keep your code understandable in Lazarus (or modern Delphi's for that matter).
But use it in a expression, and it will be converted using the ACS type.
And? As long as the conversion is correct (and converting UTF-8 <-> UTF-16 is trivial to implement), who cares how it is performed behind the scenes?
Are you saying that 1 byte based strings (good old 1970's and 1990's) char by char way of programming is the only way to make things understandable? Or are you saying that declaring ansistring as a type, specifically when you are using a 1 byte based normal old style string, is the only way to make your code readable so that people know you are using 1 byte based strings and not unicode or utf16 strings..Did you loose sight on the original question? >:D >:D