* * *

Author Topic: FPC: Unit-scope alias String for Utf8String  (Read 4043 times)

loopbreaker

  • New member
  • *
  • Posts: 32
FPC: Unit-scope alias String for Utf8String
« on: May 05, 2017, 11:40:11 am »
Proposal: User can define type alias: String = Utf8String per source unit.
This is already possible with UnicodeString (utf16), but Utf8String is missing.

Generally, "String" (Char, PChar compatible) is just an optional alias, defined (mostly implicitly) per unit.
It's an alias for convenience only, because most users prefer this short spelling over UnicodeString, Utf8String, AnsiString or others.

Unit-scope constraint means the type-alias has to be defined via compiler directive
(Alias by type declaration (any scope) remains disallowed).

String-Alias cases:

in Delphi World:
1) String = ACP AnsiString (up to D2007)
2) String = UnicodeString (utf16)   

in Lazarus/FPC World:
1) String = ACP AnsiString
2) String = UnicodeString (utf16), see {$modeswitch  unicodestrings}   
3) String = Utf8String (not possible yet)

Case 3 is a defacto need, because with all Lazarus apps there a billions of lines which use ACP Strings with runtime-adjusted codepage Utf8, regardless of the operating system (default ACP).  Case 3 is essentially the proposal, to make billions of lines safer.

Utf8String is safer (compile time benefits) than
ACP String (shares app-global codepage variable, used by all (also thirdparty) modules)

Case 3 is closer to case 2 (UnicodeString) in terms of codepage-safety, compiletime optimization and stringliteral-resolution, ie. would close the current quality-gap between Uft8 and Utf16 Lazarus apps.

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 5741
Re: FPC: Unit-scope alias String for Utf8String
« Reply #1 on: May 05, 2017, 11:48:50 am »
Please look closer at current (3.x based) Lazarus versions. They set ACP to UTF8.

loopbreaker

  • New member
  • *
  • Posts: 32
Re: FPC: Unit-scope alias String for Utf8String
« Reply #2 on: May 05, 2017, 01:17:23 pm »
Please look closer at current (3.x based) Lazarus versions. They set ACP to UTF8.

I think this is case 1 (ACP AnsiString with (ACP = UTF8) set at runtime),
but I propose an internal: type String = Utf8String (utf8 declared (known) at compile time),
analogous to the internal type String = UnicodeString (utf16 known at compile time).

To reach this type-alias within unit scope, one needs a compiler directive
(analogous to the directive which defines UnicodeString within unit-scope)

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 5741
Re: FPC: Unit-scope alias String for Utf8String
« Reply #3 on: May 05, 2017, 01:43:35 pm »
I think this is case 1 (ACP AnsiString with (ACP = UTF8) set at runtime),
but I propose an internal: type String = Utf8String (utf8 declared (known) at compile time),
analogous to the internal type String = UnicodeString (utf16 known at compile time).

I do understand what you propose, I just don't understand why you would need that. Your motivation why the current
solution doesn't work is thin.

The current utf8string never was a base type (the pre FPC3 utf8string is something different), so compatibility is not important. There is more needed than just an alias, since the new type would become the core string type that all intermediate 1-byte string results are evaluated in. RTL routines changed and validated (from hardcoded "0")

I consider the whole utf8 business a hack anyway.
« Last Edit: May 05, 2017, 01:46:34 pm by marcov »

Thaddy

  • Hero Member
  • *****
  • Posts: 4521
Re: FPC: Unit-scope alias String for Utf8String
« Reply #4 on: May 05, 2017, 02:42:44 pm »
Quote
Generally, "String" (Char, PChar compatible) is just an optional alias, defined (mostly implicitly) per unit.
It's an alias for convenience only, because most users prefer this short spelling over UnicodeString, Utf8String, AnsiString or others.
No. It is the other way around. The UTF8 string type is  the optional type.
It does not even belong to FPC, apart from UTF8 being a special case of codepage aware string.
For the compiler there is either AnsiString or UnicodeString. And that is Unicode 16.

I agree with Marco that this is not really... well...Oh, well.

So UTF8 is  the optional alias, not string...(Char, PChar). That's ONLY the case for Lazarus libraries. UTF8 is not a build in string type for the compiler.

If you need string to be AnsiString and want to stick to the Lazarus world, simply use AnsiString and AnsiChar instead of string and char. These are 1 byte based.
You should do so anyway, because it is the only way to keep your code understandable in Lazarus (or modern Delphi's for that matter).
And on a per unit basis it is still possible to declare:
Code: Pascal  [Select]
  1. type string = type AnsiString;
Note that is a typed alias... Back to the original string type that the compiler understands...

What Marco means and I concur is that the Lazarus team makes things muddy by using their own alias to UTF8. Which renders the default string types useless or cumbersome..
« Last Edit: May 05, 2017, 02:56:38 pm by Thaddy »
"Logically, no number of positive outcomes at the level of experimental testing can confirm a scientific theory, but a single counterexample is logically decisive."

loopbreaker

  • New member
  • *
  • Posts: 32
Re: FPC: Unit-scope alias String for Utf8String
« Reply #5 on: May 06, 2017, 12:07:48 pm »
Thaddy, I posted about a realistic improvement to current units of Lazarus (utf8) users: Just insert a directive {$modeswitch utf8strings}in your existing units, that's all. By this you have no hack via ACP runtime-change anymore.

Marco prefers {$modeswitch unicodestrings}, because he prefers utf16 over utf8. In this unit, Lazarus users must port their old (String = acpUtf8) code, having two options (utf16 or utf8):
a) keep "String", but ensure (possibly modify) the related code still works with utf16 instead of utf8
b) rename "String" to "Utf8String"
This is an improvement over the current status, but the question is, how long (for Lazarus devs and users) is the transition phase to (String = utf16) and how many will participate.

My proposal (extension) was for the many users which want to keep their Utf8 decision with typename "String" (here Utf8String internally).

To resume, in the long term, we have only three "String" options:
A) stay with old (String = AnsiString = acpUtf8) hack
B) use my (String = Utf8String) extension
C) use String (utf16) and Utf8String

If option B is used and Delphi compatibility required at a later stage,
then the user can opt for a search-and-replace (String by Utf8String (or customname)).

From A to compiletime solution (B or C): A to B is less work.
Every developer has to make a decision: A, B or C.
All other discussions are fruitless, because unrealistic.

Thaddy

  • Hero Member
  • *****
  • Posts: 4521
Re: FPC: Unit-scope alias String for Utf8String
« Reply #6 on: May 06, 2017, 12:19:33 pm »
No. Marco considers UTF8 as default string type and the way it is implemented a hack. As do I.
The compiler/rtl has three default string types: shortstring, ansistring and unicodestring. All of these can be aliased to string depending on compiler settings:
{$H-/+} and {$modeswitch unicodestrings}. You can also alias UTF8string to string on a per unit basis.
It is not a realistic improvement to add even more confusion than there already is. UTF8string is NEVER an internal string type, btw. The three others mentioned ARE internal string types.
If you use Lazarus, stick to UTF8. And use AnsiString and AnsiChar if you need a one byte based string type. Although it is legal to mix all 4 types over different units already, provided they are aliased on a per unit basis. (Same goes for modes, btw, these are also on a per unit basis)
The "extension" you mention is already implemented. Hence Lazarus can alias its UTF8string to string....

So what was the improvement? >:D >:D
I can already use {$modeswitch ansistrings}, {$modeswitch unicodestrings} and {$H+/-} apart from type string = type UTF8string etc. all on a per unit basis.
« Last Edit: May 06, 2017, 12:31:09 pm by Thaddy »
"Logically, no number of positive outcomes at the level of experimental testing can confirm a scientific theory, but a single counterexample is logically decisive."

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 5741
Re: FPC: Unit-scope alias String for Utf8String
« Reply #7 on: May 06, 2017, 02:09:13 pm »
Thaddy, I posted about a realistic improvement to current units of Lazarus (utf8) users: Just insert a directive {$modeswitch utf8strings}in your existing units, that's all. By this you have no hack via ACP runtime-change anymore.

You might still need it, since already compiled units (including FCL and things like TComponent and TStrings) are defined with the old string definition, since these units are compiled without that modeswitch.

Sad as the UTF8 ACS hack is, it at least fixes that.

Quote
Marco prefers {$modeswitch unicodestrings}, because he prefers utf16 over utf8.

For the long term I prefer delphi compatibility over a make-it-up-as-you-go adventure, that only strains dual maintained codebases. Making features simply Delphi compatibility also kills a lot of discussion and embellishment.

Fixing a problem is simple, what-does-delphi-do, and the direction is clear the same day. It doesn't lead to maillist discussion with several hundred mails without conclusion.

Also I think there is a lot of ill-advised spin, where the advantages of UTF8 as a web and document format are hopeless mixed up with having UTF8 as a base string type.  Most of the people writing about it don't even fundamentally understand both Windows encoding and the string type system of FPC and Delphi.

Quote
In this unit, Lazarus users must port their old (String = acpUtf8) code, having two options (utf16 or utf8):
a) keep "String", but ensure (possibly modify) the related code still works with utf16 instead of utf8
b) rename "String" to "Utf8String"

Old unclean code is toast with every which way you go. People had to make modifications from the old UTF8 hack to the new one. Really clean code is surprisingly encoding independent.

Quote
This is an improvement over the current status, but the question is, how long (for Lazarus devs and users) is the transition phase to (String = utf16) and how many will participate.

Well that is the core problem. The utf8 hack was presented as a transition horse but kills all motivation to make haste, or to even allow people to start major development if it is doubtful if it will be accepted back.

Quote
My proposal (extension) was for the many users which want to keep their Utf8 decision with typename "String" (here Utf8String internally).

And that is what the UTF8 hack also does, so the additional value is doubtful.

Quote
To resume, in the long term, we have only three "String" options:
A) stay with old (String = AnsiString = acpUtf8) hack
B) use my (String = Utf8String) extension
C) use String (utf16) and Utf8String

The official course to my best knowledge is long term (C) temporary (A).

Quote
If option B is used and Delphi compatibility required at a later stage,
then the user can opt for a search-and-replace (String by Utf8String (or customname)).

It doesn't work that way. Any assignment now becomes a conversion and thus dependent on ACS. Also what you conveniently skipped to comment on is that the implementation of such feature is more than a simple option to alias the type.

IF the STRING is 1 byte, currently the mother type for conversion is the string(0) (aka ACS). Changing the definition  of string does not change that.

Your proposal is not thoroughly researched. The best way to find out if something is doable is to start to implement it, discover problems, fix them, and then present your work.
« Last Edit: May 06, 2017, 02:10:59 pm by marcov »

Graeme

  • Hero Member
  • *****
  • Posts: 1410
    • Graeme on the web
Re: FPC: Unit-scope alias String for Utf8String
« Reply #8 on: May 08, 2017, 02:55:53 pm »
Well that is the core problem. The utf8 hack was presented as a transition horse but kills all motivation to make haste, or to even allow people to start major development if it is doubtful if it will be accepted back.
You guys (the FPC team) keep going on about the "utf8 hack", but none of you ever propose an exact alternative. So in concrete terms, what is your alternative suggestion Macro to the "utf8 hack" implemented by LCL? Some developers prefer to use the UTF-8 encoding internally in their application to truly support all of Unicode, without the UTF-16 surrogate-pair mess, which most applications using UTF-16 don't actually support (so in turn they only really handle UCS-2).
« Last Edit: May 08, 2017, 02:57:40 pm by Graeme »
--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

mse

  • Full Member
  • ***
  • Posts: 231
Re: FPC: Unit-scope alias String for Utf8String
« Reply #9 on: May 08, 2017, 03:19:06 pm »
If you need compatibility with newer Delphi versions use UnicodeString everywhere. If you want to work with utf-8 use FPC 3.0+ Utf8String everywhere. You can't have both Delphi compatibility and utf-8.

« Last Edit: May 08, 2017, 03:21:11 pm by mse »

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 5741
Re: FPC: Unit-scope alias String for Utf8String
« Reply #10 on: May 08, 2017, 06:26:30 pm »
Well that is the core problem. The utf8 hack was presented as a transition horse but kills all motivation to make haste, or to even allow people to start major development if it is doubtful if it will be accepted back.
You guys (the FPC team) keep going on about the "utf8 hack", but none of you ever propose an exact alternative.

Duh, unicodestring as basetype.

Quote
So in concrete terms, what is your alternative suggestion Macro to the "utf8 hack" implemented by LCL? Some developers prefer to use the UTF-8 encoding internally in their application to truly support all of Unicode, without the UTF-16 surrogate-pair mess, which most applications using UTF-16 don't actually support (so in turn they only really handle UCS-2).

That is like saying that utf8 like the current LCL version are not allowed because old code might then do double conversions.

Thaddy

  • Hero Member
  • *****
  • Posts: 4521
Re: FPC: Unit-scope alias String for Utf8String
« Reply #11 on: May 08, 2017, 09:01:19 pm »
That is like saying that utf8 like the current LCL version are not allowed because old code might then do double conversions.
That's basically what it boils down to...
"Logically, no number of positive outcomes at the level of experimental testing can confirm a scientific theory, but a single counterexample is logically decisive."

loopbreaker

  • New member
  • *
  • Posts: 32
Re: FPC: Unit-scope alias String for Utf8String
« Reply #12 on: May 08, 2017, 10:21:37 pm »
Mse is technically correct, the encoding-declared strings (UnicodeString, Utf8String) are there and can be used. But wide adoption of specific typenames seems unlikely. I use customnames (tsw, tsx), but such agreements are unlikely as well.

The question remains whether to use them with name "String". I'm still convinced, "String" (from language view, compiler as blackbox) is just an alias. Any other meaning would be a bug in the compiler.
(sidenote: the string with brackets (shortstring) can coexist, it does not interfere)

Example1:
unit with{$modeswitch unicodestrings}:
unit with UnicodeStrings is equivalent to this unit with Strings.
(ie., renaming back and forth should not change the behavior)

Example2:
unit with{$modeswitch utf8strings}:
unit with Utf8Strings is equivalent to this unit with Strings.
(Not to forget, Utf8String is Delphi-compatible as well)

Example3:
unit without modeswitch:
unit with AnsiStrings is equivalent to this unit with Strings.
(for codepage-safety, ACP should be fixed by the operation system, so all thirdparty modules in the app can rely on this agreement).

All three units above can coexist, because AnsiString, Utf8String and UnicodeString cooperate. The rules are simple and fully transparent.

I think we have no technical, but a political issue.
And it's highly inefficient for outsiders to invest time, just to make a change in a compiler.

Thaddy

  • Hero Member
  • *****
  • Posts: 4521
Re: FPC: Unit-scope alias String for Utf8String
« Reply #13 on: May 09, 2017, 06:58:29 am »
(Not to forget, Utf8String is Delphi-compatible as well)
No it is not. And there is no {$modeswitch utf8string}. UTF8 is not in the compiler.

"project1.lpr(2,2) Warning: Illegal compiler switch "UTF8STRING""

Listen to Marco.

Also note I already summed up what is really possible and how to  do that.
We really do not need more conversions back and forth. Because that is the direct consequence.

FYI I am not against UTF8, I even use it and it is good for some platforms, but it is counterproductive as default.


« Last Edit: May 09, 2017, 07:17:20 am by Thaddy »
"Logically, no number of positive outcomes at the level of experimental testing can confirm a scientific theory, but a single counterexample is logically decisive."

Graeme

  • Hero Member
  • *****
  • Posts: 1410
    • Graeme on the web
Re: FPC: Unit-scope alias String for Utf8String
« Reply #14 on: May 09, 2017, 01:34:33 pm »
Duh, unicodestring as basetype.
Duh, and that means only using UTF-16 and again we are stuck with the surrogate pair issue, which FPC doesn't actually help with at all. FPC doesn't know anything about surrogate pairs (as per the recent mailing list conversations).

You do also realise I mentioned "developers want to use utf-8 in their applications", and using UnicodeString (as bad as the name choice was), is only UTF-16. So no, that is not an option.
--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

 

Recent

Get Lazarus at SourceForge.net. Fast, secure and Free Open Source software downloads Open Hub project report for Lazarus