* * *

Author Topic: FPC: Unit-scope alias String for Utf8String  (Read 3637 times)

Graeme

  • Hero Member
  • *****
  • Posts: 1394
    • Graeme on the web
Re: FPC: Unit-scope alias String for Utf8String
« Reply #15 on: May 09, 2017, 01:37:03 pm »
...but it is counterproductive as default.
And using UnicodeString, which is UTF-16 only, is any better? Where the developer now has to manually check everywhere for surrogate pairs, because FPC knows nothing about surrogate pairs. I'll rather use UTF-8 thanks, where it supports the full Unicode range without me having to jump through any hoops.
--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

Graeme

  • Hero Member
  • *****
  • Posts: 1394
    • Graeme on the web
Re: FPC: Unit-scope alias String for Utf8String
« Reply #16 on: May 09, 2017, 01:39:35 pm »
You can't have both Delphi compatibility and utf-8.
Personally, I don't give a sh*t about "Delphi compatibility". I shelved Delphi over 10 years ago, and have used FPC exclusively ever since. I see no need for Delphi these days.
--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

taazz

  • Hero Member
  • *****
  • Posts: 4262
Re: FPC: Unit-scope alias String for Utf8String
« Reply #17 on: May 09, 2017, 03:04:27 pm »
You can't have both Delphi compatibility and utf-8.
Personally, I don't give a sh*t about "Delphi compatibility". I shelved Delphi over 10 years ago, and have used FPC exclusively ever since. I see no need for Delphi these days.
java,C#, C and C++ developers see no need for pascal those days either what is your point.
...but it is counterproductive as default.
And using UnicodeString, which is UTF-16 only, is any better? Where the developer now has to manually check everywhere for surrogate pairs, because FPC knows nothing about surrogate pairs. I'll rather use UTF-8 thanks, where it supports the full Unicode range without me having to jump through any hoops.
erm I did not see a need for the surrogate paired characters in utf16 at all, what ever the fpc supports is more than enough for me. LCL does not support utf8 characters correctly either. for example try to add any utf8 character higher than ordinal 255 in the password character of a TEdit and see how that works for you.

Just because you are comfortable in your little universe does not mean that every one else is comfortable with your choices nor that the current implementation does not have any shortcomings it only means that you have a blind spot. For example any non latin based text is 2 bytes long on utf8 which produces the problem of higher cpu usage for no memory or disk gains at all over utf16. Is it a problem for you? I guess not does not make it a sound choice for an international tool like lazarus though.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 5644
Re: FPC: Unit-scope alias String for Utf8String
« Reply #18 on: May 09, 2017, 03:31:38 pm »
Duh, unicodestring as basetype.
Duh, and that means only using UTF-16 and again we are stuck with the surrogate pair issue, which FPC doesn't actually help with at all. FPC doesn't know anything about surrogate pairs (as per the recent mailing list conversations).

1. Show me bugreports for specific functionality Unicode discussions on the maillist are highly coloured and overly generalist and usually not worth the trouble.
2. A lot of 1-byte string usage is not utf8 clean either and will need to be cleaned up going forward, but with the additional constraint that it must keep working with backward compatibility.

I understand that you want to find some stick to beat utf16 with. That is pointless. I don't choose utf16 because I think it is superior, but of two reasons:

1. Primarily, Delphi of course. Whatever minor advantages to an encoding over the other might have, it is not worth being hampered with both incompatibility to an ever increasing faction of Delphi users and component builders.
2. The current situation is bad in the sense that with utf8hack there is no ACS type. With ACS, utf8 is very stilted. This is annoying, though I assume it could be remedied, albeit again Delphi compatible.
3. Yes, the third point is also delphi related; having a simple test of compatibility or not saves a lot of discussions and decisions (that turn out to be bad later). The delphi model is known, flaws and all. Bad choices in an own path only emerge over time, which doesn't invite a speedy migration. Moreover because FPC implementations of very major features are often the work of differing people in differing periods it avoids the problem that a second, later implementator doesn't know if something was intended, a temporary shortcut or an honest mistake.

This is already increasingly a problem with FPC extensions. (See e.g. the encoding of case of string).

Quote
You do also realise I mentioned "developers want to use utf-8 in their applications",

Yeah, and I rejected it in an earlier post in this thread as the result of people being confused between API and document encodings and/or insensitivity to Windows encoding issues.

Quote
and using UnicodeString (as bad as the name choice was), is only UTF-16. So no, that is not an option.

You have had 9 years to get over that. Don't you think it is slowly time you stop mentioning that in every unicode post? It is getting old.
« Last Edit: May 09, 2017, 05:18:06 pm by marcov »

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 5644
Re: FPC: Unit-scope alias String for Utf8String
« Reply #19 on: May 09, 2017, 03:33:04 pm »
You can't have both Delphi compatibility and utf-8.
Personally, I don't give a sh*t about "Delphi compatibility". I shelved Delphi over 10 years ago, and have used FPC exclusively ever since. I see no need for Delphi these days.

That is duly noted, but other people use FPC/Lazarus too.

Thaddy

  • Hero Member
  • *****
  • Posts: 4438
Re: FPC: Unit-scope alias String for Utf8String
« Reply #20 on: May 09, 2017, 04:49:39 pm »
, and using UnicodeString (as bad as the name choice was), is only UTF-16. So no, that is not an option.
At least I can agree with that. (the Bad Name part that is) But I, like most of us, come from Delphi, in my case even way before Delphi. In that light UTF8 was a lesser choice.
Also note that surrogate pairs in UTF16 are comparatively rare compared to UTF8 4 byte encodings. In the languages that I use on a daily basis (French, English, Dutch, German, Russian and Lithuanian) NONE.
Also note that loads of software are still just Ansi with a code page.
« Last Edit: May 09, 2017, 04:57:56 pm by Thaddy »
"Logically, no number of positive outcomes at the level of experimental testing can confirm a scientific theory, but a single counterexample is logically decisive."

Graeme

  • Hero Member
  • *****
  • Posts: 1394
    • Graeme on the web
Re: FPC: Unit-scope alias String for Utf8String
« Reply #21 on: May 10, 2017, 12:51:41 pm »
Also note that surrogate pairs in UTF16 are comparatively rare
If you think like that you have no business in using Unicode. Stick to UCS-2 then and make it clear that your applications only support UCS-2. Don't give people false hope like the commercial text editor I purchased a while back. Almost everything new being added to the Unicode standard is being added outside the BMP range, so your problem is just going to get worse. Your comment is also highly subject, and it heavily depends on what your application is doing. I was recently working with mapping data and math formulas - both commonly used Unicode code points outside the BMP range.
--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

Graeme

  • Hero Member
  • *****
  • Posts: 1394
    • Graeme on the web
Re: FPC: Unit-scope alias String for Utf8String
« Reply #22 on: May 10, 2017, 12:53:01 pm »
erm I did not see a need for the surrogate paired characters in utf16 at all
See my reply to Thaddy.
--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

taazz

  • Hero Member
  • *****
  • Posts: 4262
Re: FPC: Unit-scope alias String for Utf8String
« Reply #23 on: May 10, 2017, 01:23:21 pm »
erm I did not see a need for the surrogate paired characters in utf16 at all
See my reply to Thaddy.
:) you have no say on what my application feature list says or does not say, the same way you have no say on what my customers consider unicode ready and what it is not.
Just out of curiosity, outside the highly specialized math realm, what other applications you used need to support the math and GIS symbols? Oh I don't mean show them on screen of course as this is part of the underline OS I mean really use. EG a language parser needs to recognize only a very narrow subset of the BMP to function properly a string comparing algorithm uses a far wider subset but no one expects to sort math or GIS symbols in any specific order so even that only uses a simple numeric comparison. What is it that it is so "widely used" out there?
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

Graeme

  • Hero Member
  • *****
  • Posts: 1394
    • Graeme on the web
Re: FPC: Unit-scope alias String for Utf8String
« Reply #24 on: May 10, 2017, 08:04:58 pm »
BMP to function properly a string comparing algorithm uses a far wider subset but no one expects to sort math or GIS symbols in any specific order so even that only uses a simple numeric comparison. What is it that it is so "widely used" out there?
Quite simple... custom reports where we needed Width and Height calculation to accurately place text, and used a custom written algorithm to reshuffle information so as to use the space on a A4 or A5 page as efficiently as possible.

We also had a literacy and memory (game) learning application that often used symbols outside the BMP range. Again, these had to be accurately placed and scaled on screen and print.
--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

taazz

  • Hero Member
  • *****
  • Posts: 4262
Re: FPC: Unit-scope alias String for Utf8String
« Reply #25 on: May 10, 2017, 09:20:53 pm »
BMP to function properly a string comparing algorithm uses a far wider subset but no one expects to sort math or GIS symbols in any specific order so even that only uses a simple numeric comparison. What is it that it is so "widely used" out there?
Quite simple... custom reports where we needed Width and Height calculation to accurately place text, and used a custom written algorithm to reshuffle information so as to use the space on a A4 or A5 page as efficiently as possible.

We also had a literacy and memory (game) learning application that often used symbols outside the BMP range. Again, these had to be accurately placed and scaled on screen and print.
so you are writing your own layout engine? isn't this already solved on linux and bsd by a third party library? Even mozila used it the last time I checked. what is so special that is not covered by the underline apis? I'm assuming that better space usage covers this but not having spend much time on text layout my self I'll probably need a visual sample to understand. Never mind I'll take your word for it.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

Remy Lebeau

  • Sr. Member
  • ****
  • Posts: 324
    • Lebeau Software
Re: FPC: Unit-scope alias String for Utf8String
« Reply #26 on: May 10, 2017, 09:29:32 pm »
You can't have both Delphi compatibility and utf-8.

Sure, you can, since Delphi also has UTF8String (and has since Delphi 6, though it wasn't a native UTF-8 string until D2009).  It is only the RTL/VCL/FMX that rely on UnicodeString, but you can use UTF8String for everything else in your own code, and freely assign UTF8String <-> UnicodeString without data loss when needed.
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) open source project - Admin, Developer

Remy Lebeau

  • Sr. Member
  • ****
  • Posts: 324
    • Lebeau Software
Re: FPC: Unit-scope alias String for Utf8String
« Reply #27 on: May 10, 2017, 09:43:35 pm »
And using UnicodeString, which is UTF-16 only, is any better? Where the developer now has to manually check everywhere for surrogate pairs, because FPC knows nothing about surrogate pairs. I'll rather use UTF-8 thanks, where it supports the full Unicode range without me having to jump through any hoops.

UTF-8 and UTF-16 both support the full Unicode range (all Unicode Transformation Formats do).  But with UTF-8, you have to handle multi-byte sequences much more frequently than surrogates in UTF-16.  UTF-8 encodes all Unicode codepoints > U+0079 (which are outside the ASCII range) using multi-byte sequences.  UTF-16, on the other hand, only encodes codepoints > U+FFFF (which are outside the UCS-2 range) using surrogates.  And the majority of human languages don't exceed that range, but things like Emoji and Symbols and such do.

UTF-8 is usually more compact than UTF-16 for storing and transmitting data (unless you are dealing with Eastern Asian languages, than UTF-16 is more compact), but most 3rd party libraries/APIs use UTF-16 instead of UTF-8 for processing data because UTF-16 is easier to process than UTF-8.  UTF-16 surrogates are easier to detect and process then UTF-8 multi-byte sequences.

If you want to support Unicode properly, you have to treat all UTFs (except for UTF-32) as variable-length, multi-codeunit encodings.  Because they really are (except for UTF-32).  Regardless of the frequency of how multi-codeunit sequences are used.
« Last Edit: May 13, 2017, 03:06:09 am by Remy Lebeau »
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) open source project - Admin, Developer

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 5644
Re: FPC: Unit-scope alias String for Utf8String
« Reply #28 on: May 10, 2017, 09:47:43 pm »
  freely assign UTF8String <-> UnicodeString without data loss when needed.

But use it in a expression, and it will be converted using the ACS type.

Remy Lebeau

  • Sr. Member
  • ****
  • Posts: 324
    • Lebeau Software
Re: FPC: Unit-scope alias String for Utf8String
« Reply #29 on: May 10, 2017, 09:52:59 pm »
  freely assign UTF8String <-> UnicodeString without data loss when needed.

But use it in a expression, and it will be converted using the ACS type.

And?  As long as the conversion is correct (and converting UTF-8 <-> UTF-16 is trivial to implement), who cares how it is performed behind the scenes?  Are you saying the conversion goes through ANSI, losing data?  If the conversion is wrong, that would be a compiler/RTL bug that needs fixing.
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) open source project - Admin, Developer

 

Recent

Get Lazarus at SourceForge.net. Fast, secure and Free Open Source software downloads Open Hub project report for Lazarus