Lazarus

Miscellaneous => Suggestions => Topic started by: loopbreaker on May 05, 2017, 11:40:11 am

Title: FPC: Unit-scope alias String for Utf8String
Post by: loopbreaker on May 05, 2017, 11:40:11 am

Proposal: User can define type alias: String = Utf8String per source unit.
This is already possible with UnicodeString (utf16), but Utf8String is missing.

Generally, "String" (Char, PChar compatible) is just an optional alias, defined (mostly implicitly) per unit.
It's an alias for convenience only, because most users prefer this short spelling over UnicodeString, Utf8String, AnsiString or others.

Unit-scope constraint means the type-alias has to be defined via compiler directive
(Alias by type declaration (any scope) remains disallowed).

String-Alias cases:

in Delphi World:
1) String = ACP AnsiString (up to D2007)
2) String = UnicodeString (utf16)

in Lazarus/FPC World:
1) String = ACP AnsiString
2) String = UnicodeString (utf16), see {$modeswitch unicodestrings}
3) String = Utf8String (not possible yet)

Case 3 is a defacto need, because with all Lazarus apps there a billions of lines which use ACP Strings with runtime-adjusted codepage Utf8, regardless of the operating system (default ACP). Case 3 is essentially the proposal, to make billions of lines safer.

Utf8String is safer (compile time benefits) than
ACP String (shares app-global codepage variable, used by all (also thirdparty) modules)

Case 3 is closer to case 2 (UnicodeString) in terms of codepage-safety, compiletime optimization and stringliteral-resolution, ie. would close the current quality-gap between Uft8 and Utf16 Lazarus apps.

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: marcov on May 05, 2017, 11:48:50 am

Please look closer at current (3.x based) Lazarus versions. They set ACP to UTF8.

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: loopbreaker on May 05, 2017, 01:17:23 pm

Quote from: marcov on May 05, 2017, 11:48:50 am

Please look closer at current (3.x based) Lazarus versions. They set ACP to UTF8.

I think this is case 1 (ACP AnsiString with (ACP = UTF8) set at runtime),
but I propose an internal: type String = Utf8String (utf8 declared (known) at compile time),
analogous to the internal type String = UnicodeString (utf16 known at compile time).

To reach this type-alias within unit scope, one needs a compiler directive
(analogous to the directive which defines UnicodeString within unit-scope)

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: marcov on May 05, 2017, 01:43:35 pm

Quote from: loopbreaker on May 05, 2017, 01:17:23 pm

I think this is case 1 (ACP AnsiString with (ACP = UTF8) set at runtime),
but I propose an internal: type String = Utf8String (utf8 declared (known) at compile time),
analogous to the internal type String = UnicodeString (utf16 known at compile time).

I do understand what you propose, I just don't understand why you would need that. Your motivation why the current
solution doesn't work is thin.

The current utf8string never was a base type (the pre FPC3 utf8string is something different), so compatibility is not important. There is more needed than just an alias, since the new type would become the core string type that all intermediate 1-byte string results are evaluated in. RTL routines changed and validated (from hardcoded "0")

I consider the whole utf8 business a hack anyway.

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: Thaddy on May 05, 2017, 02:42:44 pm

Quote

Generally, "String" (Char, PChar compatible) is just an optional alias, defined (mostly implicitly) per unit.
It's an alias for convenience only, because most users prefer this short spelling over UnicodeString, Utf8String, AnsiString or others.

No. It is the other way around. The UTF8 string type is the optional type.
It does not even belong to FPC, apart from UTF8 being a special case of codepage aware string.
For the compiler there is either AnsiString or UnicodeString. And that is Unicode 16.

I agree with Marco that this is not really... well...Oh, well.

So UTF8 is the optional alias, not string...(Char, PChar). That's ONLY the case for Lazarus libraries. UTF8 is not a build in string type for the compiler.

If you need string to be AnsiString and want to stick to the Lazarus world, simply use AnsiString and AnsiChar instead of string and char. These are 1 byte based.
You should do so anyway, because it is the only way to keep your code understandable in Lazarus (or modern Delphi's for that matter).
And on a per unit basis it is still possible to declare:

Code: Pascal [Select][+]

type string = type AnsiString;

Note that is a typed alias... Back to the original string type that the compiler understands...

What Marco means and I concur is that the Lazarus team makes things muddy by using their own alias to UTF8. Which renders the default string types useless or cumbersome..

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: loopbreaker on May 06, 2017, 12:07:48 pm

Thaddy, I posted about a realistic improvement to current units of Lazarus (utf8) users: Just insert a directive {$modeswitch utf8strings}in your existing units, that's all. By this you have no hack via ACP runtime-change anymore.

Marco prefers {$modeswitch unicodestrings}, because he prefers utf16 over utf8. In this unit, Lazarus users must port their old (String = acpUtf8) code, having two options (utf16 or utf8):
a) keep "String", but ensure (possibly modify) the related code still works with utf16 instead of utf8
b) rename "String" to "Utf8String"
This is an improvement over the current status, but the question is, how long (for Lazarus devs and users) is the transition phase to (String = utf16) and how many will participate.

My proposal (extension) was for the many users which want to keep their Utf8 decision with typename "String" (here Utf8String internally).

To resume, in the long term, we have only three "String" options:
A) stay with old (String = AnsiString = acpUtf8) hack
B) use my (String = Utf8String) extension
C) use String (utf16) and Utf8String

If option B is used and Delphi compatibility required at a later stage,
then the user can opt for a search-and-replace (String by Utf8String (or customname)).

From A to compiletime solution (B or C): A to B is less work.
Every developer has to make a decision: A, B or C.
All other discussions are fruitless, because unrealistic.

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: Thaddy on May 06, 2017, 12:19:33 pm

No. Marco considers UTF8 as default string type and the way it is implemented a hack. As do I.
The compiler/rtl has three default string types: shortstring, ansistring and unicodestring. All of these can be aliased to string depending on compiler settings:
{$H-/+} and {$modeswitch unicodestrings}. You can also alias UTF8string to string on a per unit basis.
It is not a realistic improvement to add even more confusion than there already is. UTF8string is NEVER an internal string type, btw. The three others mentioned ARE internal string types.
If you use Lazarus, stick to UTF8. And use AnsiString and AnsiChar if you need a one byte based string type. Although it is legal to mix all 4 types over different units already, provided they are aliased on a per unit basis. (Same goes for modes, btw, these are also on a per unit basis)
The "extension" you mention is already implemented. Hence Lazarus can alias its UTF8string to string....

So what was the improvement? >:D >:D
I can already use {$modeswitch ansistrings}, {$modeswitch unicodestrings} and {$H+/-} apart from type string = type UTF8string etc. all on a per unit basis.

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: marcov on May 06, 2017, 02:09:13 pm

Quote from: loopbreaker on May 06, 2017, 12:07:48 pm

Thaddy, I posted about a realistic improvement to current units of Lazarus (utf8) users: Just insert a directive {$modeswitch utf8strings}in your existing units, that's all. By this you have no hack via ACP runtime-change anymore.

You might still need it, since already compiled units (including FCL and things like TComponent and TStrings) are defined with the old string definition, since these units are compiled without that modeswitch.

Sad as the UTF8 ACS hack is, it at least fixes that.

Quote

Marco prefers {$modeswitch unicodestrings}, because he prefers utf16 over utf8.

For the long term I prefer delphi compatibility over a make-it-up-as-you-go adventure, that only strains dual maintained codebases. Making features simply Delphi compatibility also kills a lot of discussion and embellishment.

Fixing a problem is simple, what-does-delphi-do, and the direction is clear the same day. It doesn't lead to maillist discussion with several hundred mails without conclusion.

Also I think there is a lot of ill-advised spin, where the advantages of UTF8 as a web and document format are hopeless mixed up with having UTF8 as a base string type. Most of the people writing about it don't even fundamentally understand both Windows encoding and the string type system of FPC and Delphi.

Quote

In this unit, Lazarus users must port their old (String = acpUtf8) code, having two options (utf16 or utf8):
a) keep "String", but ensure (possibly modify) the related code still works with utf16 instead of utf8
b) rename "String" to "Utf8String"

Old unclean code is toast with every which way you go. People had to make modifications from the old UTF8 hack to the new one. Really clean code is surprisingly encoding independent.

Quote

This is an improvement over the current status, but the question is, how long (for Lazarus devs and users) is the transition phase to (String = utf16) and how many will participate.

Well that is the core problem. The utf8 hack was presented as a transition horse but kills all motivation to make haste, or to even allow people to start major development if it is doubtful if it will be accepted back.

Quote

My proposal (extension) was for the many users which want to keep their Utf8 decision with typename "String" (here Utf8String internally).

And that is what the UTF8 hack also does, so the additional value is doubtful.

Quote

To resume, in the long term, we have only three "String" options:
A) stay with old (String = AnsiString = acpUtf8) hack
B) use my (String = Utf8String) extension
C) use String (utf16) and Utf8String

The official course to my best knowledge is long term (C) temporary (A).

Quote

If option B is used and Delphi compatibility required at a later stage,
then the user can opt for a search-and-replace (String by Utf8String (or customname)).

It doesn't work that way. Any assignment now becomes a conversion and thus dependent on ACS. Also what you conveniently skipped to comment on is that the implementation of such feature is more than a simple option to alias the type.

IF the STRING is 1 byte, currently the mother type for conversion is the string(0) (aka ACS). Changing the definition of string does not change that.

Your proposal is not thoroughly researched. The best way to find out if something is doable is to start to implement it, discover problems, fix them, and then present your work.

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: Graeme on May 08, 2017, 02:55:53 pm

Quote from: marcov on May 06, 2017, 02:09:13 pm

Well that is the core problem. The utf8 hack was presented as a transition horse but kills all motivation to make haste, or to even allow people to start major development if it is doubtful if it will be accepted back.

You guys (the FPC team) keep going on about the "utf8 hack", but none of you ever propose an exact alternative. So in concrete terms, what is your alternative suggestion Macro to the "utf8 hack" implemented by LCL? Some developers prefer to use the UTF-8 encoding internally in their application to truly support all of Unicode, without the UTF-16 surrogate-pair mess, which most applications using UTF-16 don't actually support (so in turn they only really handle UCS-2).

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: mse on May 08, 2017, 03:19:06 pm

If you need compatibility with newer Delphi versions use UnicodeString everywhere. If you want to work with utf-8 use FPC 3.0+ Utf8String everywhere. You can't have both Delphi compatibility and utf-8.

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: marcov on May 08, 2017, 06:26:30 pm

Quote from: Graeme on May 08, 2017, 02:55:53 pm

Quote from: marcov on May 06, 2017, 02:09:13 pm
Well that is the core problem. The utf8 hack was presented as a transition horse but kills all motivation to make haste, or to even allow people to start major development if it is doubtful if it will be accepted back.
You guys (the FPC team) keep going on about the "utf8 hack", but none of you ever propose an exact alternative.

Duh, unicodestring as basetype.

Quote

So in concrete terms, what is your alternative suggestion Macro to the "utf8 hack" implemented by LCL? Some developers prefer to use the UTF-8 encoding internally in their application to truly support all of Unicode, without the UTF-16 surrogate-pair mess, which most applications using UTF-16 don't actually support (so in turn they only really handle UCS-2).

That is like saying that utf8 like the current LCL version are not allowed because old code might then do double conversions.

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: Thaddy on May 08, 2017, 09:01:19 pm

Quote from: marcov on May 08, 2017, 06:26:30 pm

That is like saying that utf8 like the current LCL version are not allowed because old code might then do double conversions.

That's basically what it boils down to...

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: loopbreaker on May 08, 2017, 10:21:37 pm

Mse is technically correct, the encoding-declared strings (UnicodeString, Utf8String) are there and can be used. But wide adoption of specific typenames seems unlikely. I use customnames (tsw, tsx), but such agreements are unlikely as well.

The question remains whether to use them with name "String". I'm still convinced, "String" (from language view, compiler as blackbox) is just an alias. Any other meaning would be a bug in the compiler.
(sidenote: the string with brackets (shortstring) can coexist, it does not interfere)

Example1:
unit with{$modeswitch unicodestrings}:
unit with UnicodeStrings is equivalent to this unit with Strings.
(ie., renaming back and forth should not change the behavior)

Example2:
unit with{$modeswitch utf8strings}:
unit with Utf8Strings is equivalent to this unit with Strings.
(Not to forget, Utf8String is Delphi-compatible as well)

Example3:
unit without modeswitch:
unit with AnsiStrings is equivalent to this unit with Strings.
(for codepage-safety, ACP should be fixed by the operation system, so all thirdparty modules in the app can rely on this agreement).

All three units above can coexist, because AnsiString, Utf8String and UnicodeString cooperate. The rules are simple and fully transparent.

I think we have no technical, but a political issue.
And it's highly inefficient for outsiders to invest time, just to make a change in a compiler.

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: Thaddy on May 09, 2017, 06:58:29 am

Quote from: loopbreaker on May 08, 2017, 10:21:37 pm

(Not to forget, Utf8String is Delphi-compatible as well)

No it is not. And there is no {$modeswitch utf8string}. UTF8 is not in the compiler.

"project1.lpr(2,2) Warning: Illegal compiler switch "UTF8STRING""

Listen to Marco.

Also note I already summed up what is really possible and how to do that.
We really do not need more conversions back and forth. Because that is the direct consequence.

FYI I am not against UTF8, I even use it and it is good for some platforms, but it is counterproductive as default.

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: Graeme on May 09, 2017, 01:34:33 pm

Quote from: marcov on May 08, 2017, 06:26:30 pm

Duh, unicodestring as basetype.

Duh, and that means only using UTF-16 and again we are stuck with the surrogate pair issue, which FPC doesn't actually help with at all. FPC doesn't know anything about surrogate pairs (as per the recent mailing list conversations).

You do also realise I mentioned "developers want to use utf-8 in their applications", and using UnicodeString (as bad as the name choice was), is only UTF-16. So no, that is not an option.

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: Graeme on May 09, 2017, 01:37:03 pm

Quote from: Thaddy on May 09, 2017, 06:58:29 am

...but it is counterproductive as default.

And using UnicodeString, which is UTF-16 only, is any better? Where the developer now has to manually check everywhere for surrogate pairs, because FPC knows nothing about surrogate pairs. I'll rather use UTF-8 thanks, where it supports the full Unicode range without me having to jump through any hoops.

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: Graeme on May 09, 2017, 01:39:35 pm

Quote from: mse on May 08, 2017, 03:19:06 pm

You can't have both Delphi compatibility and utf-8.

Personally, I don't give a sh*t about "Delphi compatibility". I shelved Delphi over 10 years ago, and have used FPC exclusively ever since. I see no need for Delphi these days.

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: taazz on May 09, 2017, 03:04:27 pm

Quote from: Graeme on May 09, 2017, 01:39:35 pm

Quote from: mse on May 08, 2017, 03:19:06 pm
You can't have both Delphi compatibility and utf-8.
Personally, I don't give a sh*t about "Delphi compatibility". I shelved Delphi over 10 years ago, and have used FPC exclusively ever since. I see no need for Delphi these days.

java,C#, C and C++ developers see no need for pascal those days either what is your point.

Quote from: Graeme on May 09, 2017, 01:37:03 pm

Quote from: Thaddy on May 09, 2017, 06:58:29 am
...but it is counterproductive as default.
And using UnicodeString, which is UTF-16 only, is any better? Where the developer now has to manually check everywhere for surrogate pairs, because FPC knows nothing about surrogate pairs. I'll rather use UTF-8 thanks, where it supports the full Unicode range without me having to jump through any hoops.

erm I did not see a need for the surrogate paired characters in utf16 at all, what ever the fpc supports is more than enough for me. LCL does not support utf8 characters correctly either. for example try to add any utf8 character higher than ordinal 255 in the password character of a TEdit and see how that works for you.

Just because you are comfortable in your little universe does not mean that every one else is comfortable with your choices nor that the current implementation does not have any shortcomings it only means that you have a blind spot. For example any non latin based text is 2 bytes long on utf8 which produces the problem of higher cpu usage for no memory or disk gains at all over utf16. Is it a problem for you? I guess not does not make it a sound choice for an international tool like lazarus though.

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: marcov on May 09, 2017, 03:31:38 pm

Quote from: Graeme on May 09, 2017, 01:34:33 pm

Quote from: marcov on May 08, 2017, 06:26:30 pm
Duh, unicodestring as basetype.
Duh, and that means only using UTF-16 and again we are stuck with the surrogate pair issue, which FPC doesn't actually help with at all. FPC doesn't know anything about surrogate pairs (as per the recent mailing list conversations).

1. Show me bugreports for specific functionality Unicode discussions on the maillist are highly coloured and overly generalist and usually not worth the trouble.
2. A lot of 1-byte string usage is not utf8 clean either and will need to be cleaned up going forward, but with the additional constraint that it must keep working with backward compatibility.

I understand that you want to find some stick to beat utf16 with. That is pointless. I don't choose utf16 because I think it is superior, but of two reasons:

1. Primarily, Delphi of course. Whatever minor advantages to an encoding over the other might have, it is not worth being hampered with both incompatibility to an ever increasing faction of Delphi users and component builders.
2. The current situation is bad in the sense that with utf8hack there is no ACS type. With ACS, utf8 is very stilted. This is annoying, though I assume it could be remedied, albeit again Delphi compatible.
3. Yes, the third point is also delphi related; having a simple test of compatibility or not saves a lot of discussions and decisions (that turn out to be bad later). The delphi model is known, flaws and all. Bad choices in an own path only emerge over time, which doesn't invite a speedy migration. Moreover because FPC implementations of very major features are often the work of differing people in differing periods it avoids the problem that a second, later implementator doesn't know if something was intended, a temporary shortcut or an honest mistake.

This is already increasingly a problem with FPC extensions. (See e.g. the encoding of case of string).

Quote

You do also realise I mentioned "developers want to use utf-8 in their applications",

Yeah, and I rejected it in an earlier post in this thread as the result of people being confused between API and document encodings and/or insensitivity to Windows encoding issues.

Quote

and using UnicodeString (as bad as the name choice was), is only UTF-16. So no, that is not an option.

You have had 9 years to get over that. Don't you think it is slowly time you stop mentioning that in every unicode post? It is getting old.

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: marcov on May 09, 2017, 03:33:04 pm

Quote from: Graeme on May 09, 2017, 01:39:35 pm

Quote from: mse on May 08, 2017, 03:19:06 pm
You can't have both Delphi compatibility and utf-8.
Personally, I don't give a sh*t about "Delphi compatibility". I shelved Delphi over 10 years ago, and have used FPC exclusively ever since. I see no need for Delphi these days.

That is duly noted, but other people use FPC/Lazarus too.

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: Thaddy on May 09, 2017, 04:49:39 pm

Quote from: Graeme on May 09, 2017, 01:34:33 pm

, and using UnicodeString (as bad as the name choice was), is only UTF-16. So no, that is not an option.

At least I can agree with that. (the Bad Name part that is) But I, like most of us, come from Delphi, in my case even way before Delphi. In that light UTF8 was a lesser choice.
Also note that surrogate pairs in UTF16 are comparatively rare compared to UTF8 4 byte encodings. In the languages that I use on a daily basis (French, English, Dutch, German, Russian and Lithuanian) NONE.
Also note that loads of software are still just Ansi with a code page.

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: Graeme on May 10, 2017, 12:51:41 pm

Quote from: Thaddy on May 09, 2017, 04:49:39 pm

Also note that surrogate pairs in UTF16 are comparatively rare

If you think like that you have no business in using Unicode. Stick to UCS-2 then and make it clear that your applications only support UCS-2. Don't give people false hope like the commercial text editor I purchased a while back. Almost everything new being added to the Unicode standard is being added outside the BMP range, so your problem is just going to get worse. Your comment is also highly subject, and it heavily depends on what your application is doing. I was recently working with mapping data and math formulas - both commonly used Unicode code points outside the BMP range.

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: Graeme on May 10, 2017, 12:53:01 pm

Quote from: taazz on May 09, 2017, 03:04:27 pm

erm I did not see a need for the surrogate paired characters in utf16 at all

See my reply to Thaddy.

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: taazz on May 10, 2017, 01:23:21 pm

Quote from: Graeme on May 10, 2017, 12:53:01 pm

Quote from: taazz on May 09, 2017, 03:04:27 pm
erm I did not see a need for the surrogate paired characters in utf16 at all
See my reply to Thaddy.

:) you have no say on what my application feature list says or does not say, the same way you have no say on what my customers consider unicode ready and what it is not.
Just out of curiosity, outside the highly specialized math realm, what other applications you used need to support the math and GIS symbols? Oh I don't mean show them on screen of course as this is part of the underline OS I mean really use. EG a language parser needs to recognize only a very narrow subset of the BMP to function properly a string comparing algorithm uses a far wider subset but no one expects to sort math or GIS symbols in any specific order so even that only uses a simple numeric comparison. What is it that it is so "widely used" out there?

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: Graeme on May 10, 2017, 08:04:58 pm

Quote from: taazz on May 10, 2017, 01:23:21 pm

BMP to function properly a string comparing algorithm uses a far wider subset but no one expects to sort math or GIS symbols in any specific order so even that only uses a simple numeric comparison. What is it that it is so "widely used" out there?

Quite simple... custom reports where we needed Width and Height calculation to accurately place text, and used a custom written algorithm to reshuffle information so as to use the space on a A4 or A5 page as efficiently as possible.

We also had a literacy and memory (game) learning application that often used symbols outside the BMP range. Again, these had to be accurately placed and scaled on screen and print.

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: taazz on May 10, 2017, 09:20:53 pm

Quote from: Graeme on May 10, 2017, 08:04:58 pm

Quote from: taazz on May 10, 2017, 01:23:21 pm
BMP to function properly a string comparing algorithm uses a far wider subset but no one expects to sort math or GIS symbols in any specific order so even that only uses a simple numeric comparison. What is it that it is so "widely used" out there?
Quite simple... custom reports where we needed Width and Height calculation to accurately place text, and used a custom written algorithm to reshuffle information so as to use the space on a A4 or A5 page as efficiently as possible.

We also had a literacy and memory (game) learning application that often used symbols outside the BMP range. Again, these had to be accurately placed and scaled on screen and print.

so you are writing your own layout engine? isn't this already solved on linux and bsd by a third party library? Even mozila used it the last time I checked. what is so special that is not covered by the underline apis? I'm assuming that better space usage covers this but not having spend much time on text layout my self I'll probably need a visual sample to understand. Never mind I'll take your word for it.

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: Remy Lebeau on May 10, 2017, 09:29:32 pm

Quote from: mse on May 08, 2017, 03:19:06 pm

You can't have both Delphi compatibility and utf-8.

Sure, you can, since Delphi also has UTF8String (and has since Delphi 6, though it wasn't a native UTF-8 string until D2009). It is only the RTL/VCL/FMX that rely on UnicodeString, but you can use UTF8String for everything else in your own code, and freely assign UTF8String <-> UnicodeString without data loss when needed.

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: Remy Lebeau on May 10, 2017, 09:43:35 pm

Quote from: Graeme on May 09, 2017, 01:37:03 pm

And using UnicodeString, which is UTF-16 only, is any better? Where the developer now has to manually check everywhere for surrogate pairs, because FPC knows nothing about surrogate pairs. I'll rather use UTF-8 thanks, where it supports the full Unicode range without me having to jump through any hoops.

UTF-8 and UTF-16 both support the full Unicode range (all Unicode Transformation Formats do). But with UTF-8, you have to handle multi-byte sequences much more frequently than surrogates in UTF-16. UTF-8 encodes all Unicode codepoints > U+0079 (which are outside the ASCII range) using multi-byte sequences. UTF-16, on the other hand, only encodes codepoints > U+FFFF (which are outside the UCS-2 range) using surrogates. And the majority of human languages don't exceed that range, but things like Emoji and Symbols and such do.

UTF-8 is usually more compact than UTF-16 for storing and transmitting data (unless you are dealing with Eastern Asian languages, than UTF-16 is more compact), but most 3rd party libraries/APIs use UTF-16 instead of UTF-8 for processing data because UTF-16 is easier to process than UTF-8. UTF-16 surrogates are easier to detect and process then UTF-8 multi-byte sequences.

If you want to support Unicode properly, you have to treat all UTFs (except for UTF-32) as variable-length, multi-codeunit encodings. Because they really are (except for UTF-32). Regardless of the frequency of how multi-codeunit sequences are used.

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: marcov on May 10, 2017, 09:47:43 pm

Quote from: Remy Lebeau on May 10, 2017, 09:29:32 pm

freely assign UTF8String <-> UnicodeString without data loss when needed.

But use it in a expression, and it will be converted using the ACS type.

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: Remy Lebeau on May 10, 2017, 09:52:59 pm

Quote from: marcov on May 10, 2017, 09:47:43 pm

Quote from: Remy Lebeau on May 10, 2017, 09:29:32 pm
freely assign UTF8String <-> UnicodeString without data loss when needed.

But use it in a expression, and it will be converted using the ACS type.

And? As long as the conversion is correct (and converting UTF-8 <-> UTF-16 is trivial to implement), who cares how it is performed behind the scenes? Are you saying the conversion goes through ANSI, losing data? If the conversion is wrong, that would be a compiler/RTL bug that needs fixing.

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: Graeme on May 11, 2017, 11:27:02 am

Quote from: Remy Lebeau on May 10, 2017, 09:43:35 pm

UTF-8 and UTF-16 both support the full Unicode range (all UTFs do).

I obviously know that, but it seems here are some people in this forum that doesn't. Many seem to think UTF-16 is BMP only. %)

Quote

If you want to support Unicode properly, you have to treat all UTFs (except for UTF-32) as variable-length, multi-codeunit encodings.

Exactly.

The other thing you need to content with when using UTF-16, is the endianess. Are you working with UTF-16LE or UTF-16BE encoded text data. Once again with UTF-8 you don't need to worry about that at all. I find the UTF-8 encoding much simpler and easier to implement and use. Much less can go wrong when using UTF-8, and much less to worry about.

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: z505 on May 12, 2017, 11:25:14 am

Quote from: Thaddy on May 05, 2017, 02:42:44 pm

If you need string to be AnsiString and want to stick to the Lazarus world, simply use AnsiString and AnsiChar instead of string and char. These are 1 byte based.
You should do so anyway, because it is the only way to keep your code understandable in Lazarus (or modern Delphi's for that matter).

Are you saying that 1 byte based strings (good old 1970's and 1990's) char by char way of programming is the only way to make things understandable? Or are you saying that declaring ansistring as a type, specifically when you are using a 1 byte based normal old style string, is the only way to make your code readable so that people know you are using 1 byte based strings and not unicode or utf16 strings..

Any case, I agree with you either way, but specifically I agree with the first argument that old 1970's char by char programming is the only way to make any program readable ;-)

As IMO a small subset like 255 characters is much much easier to prove a program correct with, than a literally infinite characterset that is literally unprovable. Possibly even untestable to infinity

But not sure if this is what you meant :-)

Would this string alias that the original poster mentioned, be a compiler switch such as the $H+ switch, or a type declaration?

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: marcov on May 12, 2017, 11:54:56 am

Quote from: Remy Lebeau on May 10, 2017, 09:52:59 pm

Quote from: marcov on May 10, 2017, 09:47:43 pm
But use it in a expression, and it will be converted using the ACS type.

And? As long as the conversion is correct (and converting UTF-8 <-> UTF-16 is trivial to implement), who cares how it is performed behind the scenes?

Combinations are converted to the basetype. In FPC objfpc/delphi mode that is ansistring(0) (ACS) so the result will be corrupt unless when in lazarus hackmode (since then ACS=utf8). But requiring the lazarus hack defies working with utf8string as separate type if it requires ACS=utf8 under the hood anyway. Which is Delphi incompatible since you don't have a stringtype to process Windows real ACS strings.

Delphi/unicode always has unicodestring as base type, so will probably calculate the intermediate result in unicodestring and then assign it to utf8string, so no problem.

I don't know if $mode Delphiunicode already fixes this, but even if it does it requires large amounts of RTL+libraries to be compiled in that mode to be safe.

Title: Re: FPC: Unit-scope alias String for Utf8String
Post by: Thaddy on May 12, 2017, 12:05:51 pm

Quote from: z505 on May 12, 2017, 11:25:14 am

Are you saying that 1 byte based strings (good old 1970's and 1990's) char by char way of programming is the only way to make things understandable? Or are you saying that declaring ansistring as a type, specifically when you are using a 1 byte based normal old style string, is the only way to make your code readable so that people know you are using 1 byte based strings and not unicode or utf16 strings..

Did you loose sight on the original question? >:D >:D
Anyway. Yes. because string <> string.. If you want one byte strings in either UTF8 or UTF16, plz call them Ansistring.