Recent

Author Topic: Is it supposed to be possible to build FPC in "Unicode" mode, or not?  (Read 8896 times)

Akira1364

  • Hero Member
  • *****
  • Posts: 539
I've heard various people mention on different occasions about building the RTL with FPC_OS_UNICODE, EnableUTF8RTL, e.t.c. None of my attempts to do this have ever been successful, regardless of what combination of defines I use. (This is always with the latest trunk). Am I missing something? Were these people using specifically modified sources?

Thaddy

  • Hero Member
  • *****
  • Posts: 9152
Re: Is it supposed to be possible to build FPC in "Unicode" mode, or not?
« Reply #1 on: January 12, 2016, 05:38:54 am »
For just fpc that would be:
Code: Pascal  [Select]
  1. make clean all install OPT="-dFPC_OS_UNICODE"
  2.  

You need fpc 3 and/or trunk for this, but you can use 2.6.4 to bootstrap i Believe.
« Last Edit: January 12, 2016, 05:47:41 am by Thaddy »
also related to equus asinus.

Jonas Maebe

  • Hero Member
  • *****
  • Posts: 669
Re: Is it supposed to be possible to build FPC in "Unicode" mode, or not?
« Reply #2 on: January 12, 2016, 09:26:09 am »
I've heard various people mention on different occasions about building the RTL with FPC_OS_UNICODE, EnableUTF8RTL, e.t.c. None of my attempts to do this have ever been successful, regardless of what combination of defines I use. (This is always with the latest trunk). Am I missing something? Were these people using specifically modified sources?
FPC_OS_UNICODE is a define that is set if the target OS uses UTF-16 for its API functions. It is unrelated to how the RTL was built. See http://wiki.freepascal.org/User_Changes_3.0#Define_UNICODE_was_changed_to_FPC_OS_UNICODE

EnableUTF8RTL is a define you can use while building the LCL, and unrelated to FPC or its RTL. Or rather, it was a define you could use: nowadays the related functionality is enabled by default in Lazarus if you compile with FPC 3.0 or later, and you can instead disable it by using -dDisableUTF8RTL. See http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus

Compiling the FPC RTL in unicode/UTF-16 mode is not yet supported, see http://wiki.freepascal.org/FPC_New_Features_3.0#New_delphiunicode_syntax_mode .

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7495
Re: Is it supposed to be possible to build FPC in "Unicode" mode, or not?
« Reply #3 on: January 12, 2016, 11:01:01 am »
To state  it more concrete. FPC_OS_UNICODE mainly means that the aliases for "T"  functions and types (without A and W) of the Windows then default to the W variant (like in D2009+) and not the A variant like in before 2009.

This define has very little influence beyond the windows header units.

tk

  • Sr. Member
  • ****
  • Posts: 364
Re: Is it supposed to be possible to build FPC in "Unicode" mode, or not?
« Reply #4 on: January 30, 2016, 07:55:54 pm »
Just have seen this post and a question wo much reading and testing:

Can I switch the new Lazarus 1.6 with FPC3.0 to UTF16 completely?

The goal is to be compatible with Delphi UTF16 encoded UnicodeString.
There we use only simple indexed access to codepoints (and ignore surrogate pairs).
And this has been a compatibility problem Delphi vs. Lazarus so far:
http://wiki.lazarus.freepascal.org/for-in_loop#Traversing_UTF-8_strings


Thaddy

  • Hero Member
  • *****
  • Posts: 9152
Re: Is it supposed to be possible to build FPC in "Unicode" mode, or not?
« Reply #5 on: January 30, 2016, 08:14:10 pm »
No, you can't (yet) without problems. You can build it. It works, but only sort of.
also related to equus asinus.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 3645
  • I like bugs.
Re: Is it supposed to be possible to build FPC in "Unicode" mode, or not?
« Reply #6 on: January 30, 2016, 11:48:03 pm »
The goal is to be compatible with Delphi UTF16 encoded UnicodeString.
There we use only simple indexed access to codepoints (and ignore surrogate pairs).

No, you have simple indexed access to codeunits, not codepoints. UTF-16 is not a fixed-width encoding.
Yes, there is lots of broken Delphi code that treats it as fixed-width.

Quote
And this has been a compatibility problem Delphi vs. Lazarus so far:
http://wiki.lazarus.freepascal.org/for-in_loop#Traversing_UTF-8_strings

The good news is that enumerator classes and helper functions can be made for both UTF-8 and UTF-16 using the same semantics.
They could be used transparently for either encoding in a future Lazarus version, making them almost source compatible systems.
A single UTF-16 codepoint would be kept in a string like it is now with UTF-8.
TStringEnumerator would be a UTF-16 version when UTF-16 is the default encoding, and a UTF-8 version otherwise.
CodePointLength(), CodePointCopy() and similar functions could abstract the access of codepoints.

As an extra benefit the code for UTF-16 becomes more robust and less buggy.
I think they should be implemented soon.

tk

  • Sr. Member
  • ****
  • Posts: 364
Re: Is it supposed to be possible to build FPC in "Unicode" mode, or not?
« Reply #7 on: April 14, 2016, 05:52:00 pm »
No, you have simple indexed access to codeunits, not codepoints. UTF-16 is not a fixed-width encoding.
Yes, there is lots of broken Delphi code that treats it as fixed-width.

Unless you write a very, very special application that would need to show characters outside of the basic multilingual plane you can take UTF16=UCS2.
Simple indexed access is then possible.
I would not call any Delphi code to be "broken" because originally it used this encoding and I don't know anyone who would bother with surrogate pairs.
I/we are neither. We have a lot of code with just indexed access in Delphi!
UCS2(=UTF16 when neglecting surrogates) is a very good trade-off between memory requirements for the string storage and the simplicity of string manipulation algorithms.
And that's also why i would highly appreciate to see Lazarus completely in UTF16 (or switchable to UTF16) or at least UCS2.

Any progress improving Lazarus compatibility to Delphi is welcome, at some point I should be finally able to open my large project group (originally created in Delphi XE+) in Lazarus and rebuild it as is. That day I would celebrate!

loopbreaker

  • New Member
  • *
  • Posts: 32
Re: Is it supposed to be possible to build FPC in "Unicode" mode, or not?
« Reply #8 on: April 14, 2016, 09:34:51 pm »
Any progress improving Lazarus compatibility to Delphi is welcome

Do you mean, also in Lazarus, the regular String should be UnicodeString?
Recently I had a similar proposal http://forum.lazarus.freepascal.org/index.php/topic,32248.0.html.

I think, the blocking issue is the main Lazarus contributors want String to
remain utf8, although in this case they could resolve all issues
by type renaming (to Utf8String or shorter alias) in Lazarus.

In my own projects (still Delphi only) I don't use String at all,
but have own aliases (tsx = Utf8String, tsw = UnicodeString),
so I'm always sure what to get.

BTW, some codepoints (eg. emoticons) above BMP are useful :)

loopbreaker

  • New Member
  • *
  • Posts: 32
Re: Is it supposed to be possible to build FPC in "Unicode" mode, or not?
« Reply #9 on: April 17, 2016, 02:08:53 pm »
to be more precise, I meant the definition of String mainly,
so that String itself is compatible in all sources.

The preference of where String (=UnicodeString) is actually
used later, is adjustable separately, because String and
Utf8String can co-exist. For example, the LCL can remain utf8, 
but via Utf8String.

So if someone has the energy, he could proceed in steps;
after each step everything is still working, means later steps are not mandatory.

1) Create a fork of Lazarus, rename every occurrence of String
(is already used as utf8) to Utf8String; Char to AnsiChar, PChar to PAnsiChar.
Source files should be utf8.
2) To define the new meaning of String/Char/PChar,
build everything with {$mode delphiunicode}.
3) Possibly, add methods, properties with UnicodeString param
(wrappers around utf8 can suffice) to minimize the number
of implicit conversions. This last step could also be scaled up
to Delphi api compatibility, if he really has this wish.

Graeme

  • Hero Member
  • *****
  • Posts: 1430
    • Graeme on the web
Re: Is it supposed to be possible to build FPC in "Unicode" mode, or not?
« Reply #10 on: April 17, 2016, 08:13:57 pm »
There we use only simple indexed access to codepoints (and ignore surrogate pairs).
Wow! Do you guys not learn at all! Seems there is no hope  for some "programmers".  :o
--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

Graeme

  • Hero Member
  • *****
  • Posts: 1430
    • Graeme on the web
Re: Is it supposed to be possible to build FPC in "Unicode" mode, or not?
« Reply #11 on: April 17, 2016, 08:20:16 pm »
Unless you write a very, very special application that would need to show characters outside of the basic multilingual plane you can take UTF16=UCS2.
Simple indexed access is then possible.
Yeah, clearly Unicode is lost to you and many others too. More and more codepoints above the BMP are being used. So it is becoming more common place than you think. Map symbols (many apps have map integration these days), emoticons (God, everybody uses those these days), music symbols (quite popular too). As was shown before.... Your thinking is severely broken. Just like this Forum software that is supposed to support Unicode but is quickly proven that it is broken and in fact only supports UCS2.  UCS2 IS NOT UTF-16!

Quote
UCS2(=UTF16 when neglecting surrogates) is a very good trade-off
I'll add that as a question in my next technical test before we hire a programmer. If you answer Yes to the above, you will NOT be hired by me. If you are that sloppy in string handling, it speaks volumes for how your other code must look too.
« Last Edit: April 17, 2016, 08:24:48 pm by Graeme »
--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

mse

  • Sr. Member
  • ****
  • Posts: 286
Re: Is it supposed to be possible to build FPC in "Unicode" mode, or not?
« Reply #12 on: April 17, 2016, 09:51:25 pm »
I'll add that as a question in my next technical test before we hire a programmer. If you answer Yes to the above, you will NOT be hired by me. If you are that sloppy in string handling, it speaks volumes for how your other code must look too.
Then you lost me as a potential employee. ;-)
Graeme, there are many instances where one can savely ignore surrogate pairs in utf-16 like there are cases where one can ignore multibyte codepoints in utf-8.
For example one often searches for character constants in strings where it is known that they are in BMP for utf-16 or in case of utf-8 that they are in ASCII range. And that is absolutely safe. There are no codepoints assigned in BMP in the range of surrogate pair codeunits.

Graeme

  • Hero Member
  • *****
  • Posts: 1430
    • Graeme on the web
Re: Is it supposed to be possible to build FPC in "Unicode" mode, or not?
« Reply #13 on: April 18, 2016, 12:03:24 am »
Then you lost me as a potential employee. ;-)
Then my company is better off for it.

Quote
Graeme, there are many instances where one can savely ignore surrogate pairs in utf-16 like there are cases where one can ignore multibyte codepoints in utf-8.
Absolute rubbish! You have no idea what the end-user might need or not. For example: I purchased a well respected professional grade text editor, because it was renowned for having the best Unicode and just about any other encoding you can think of, support. Coincidently the application was written with the latest Delphi.

I used that text editor some 6 months ago to edit a client's data files for a mapping application. Surprise, surprise, the mapping data I worked with uses code points that appears outside the BMP range. I logged a support ticket for that text editor and I was told the text editor actually only supports BMP (thus UCS2 - NOT Unicode). Their excuse - most spoken languages appear in the BMP range, so they didn't expect anybody to actually use code points outside the BMP range. Well guess what, I needed it, and I'm sure others do to.

I had to use jEdit (written in Java) in the end, which actually does supports the full Unicode range. And to add salt to the wound - jEdit is free and open source!

Again, UCS2 is not Unicode. As a programmer you should know how evil (and flawed) it is to assume. You can't simply assume that the end-user will never use any of the valid (as of Unicode 8.0)  198,864 code points outside the BMP range. The BMP only covers 65,392 code points out of a total of 264,256 [https://en.wikipedia.org/wiki/Plane_%28Unicode%29#Overview]. All of them are valid code points.

It's just sloppy programming.
--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 5693
    • wiki
Re: Is it supposed to be possible to build FPC in "Unicode" mode, or not?
« Reply #14 on: April 18, 2016, 01:26:28 am »
There we use only simple indexed access to codepoints (and ignore surrogate pairs).
Well your choice to ignore surrogates. If you (and/or your boss) know what you are doing, and have good reasons, then you can do that.

But the statement above is misleading by omission.
- You state that you ignore surrogates.
- You should also explicitly mention that "access to codepoints" means exactly that: codepoints. You are *not* interested in characters. Only in codepoints.

You may be aware of it, and you may have meant this when you wrote it that way. But there are lots of readers here, who will read your statement and think that character and codepoint are the same. And they are not. (not even in utf32)

------------------
The other thing is, what you want to archive by "index access"

1) Easier to write code?
Maybe, but not really, since plenty of good helper functions exist for utf8. Its only about getting used to them.

2) execution speed.
Not necessarily. If you write efficient utf8 code, then this can be faster than the best utf16 code (at least with european based text). Why? Because if the text is really large, then organizing your data in memory in a way that reduces cache misses has a huge influence on speed. Using more memory for your data increases cache misses, and slows the app down.

With Chinese text utf16 (with full support, and no index access) may be faster then.

Of course archiving 2, means that you loose on 1. You may have to spent more time coding.
« Last Edit: April 18, 2016, 01:39:14 am by Martin_fr »