Recent

Author Topic: What is UTF-8 Application  (Read 25735 times)

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2770
    • havefunsoft.com
Re: What is UTF-8 Application
« Reply #30 on: February 09, 2015, 04:00:45 pm »
I wish pascal "string" would become this kind of opaque type. And with added "encoding" it could (or is it?). Thus the same "string" type could be either ansi, utf8 or utf-16.
Yes, the case need to be taken when the encoding is utf-16 encoding is used (since string is "1-byte" and utf-16 is "2-byte" encoding).. but in the end either utf-16 and utf-8 are "multibyte" encoding anyway.

felipemdc

  • Administrator
  • Hero Member
  • *
  • Posts: 3538
Re: What is UTF-8 Application
« Reply #31 on: February 09, 2015, 04:03:28 pm »
A framework typically provides necessary functions for line #3, so by a framework design you don't have to convert framework string to utf-8 set of characters.
Grab the framework-stingray -> do operations via framework API -> put it back.

The framework cannot possibly predict every single string operation possible. Anyway, I'm like Linus Torvalds here, I don't want the compiler to do automatic conversions in my back ;) I want control and I want 1 encoding in every operating system. And I don't see why I should put up with anything less than I expect.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12655
  • FPC developer.
Re: What is UTF-8 Application
« Reply #32 on: February 09, 2015, 04:10:57 pm »
I wish pascal "string" would become this kind of opaque type. And with added "encoding" it could (or is it?). Thus the same "string" type could be either ansi, utf8 or utf-16.
Yes, the case need to be taken when the encoding is utf-16 encoding is used (since string is "1-byte" and utf-16 is "2-byte" encoding).. but in the end either utf-16 and utf-8 are "multibyte" encoding anyway.

(Note that I meant that most code should be programmed as if it were, not that I meant it actually should be. Because then swapping the string basetype on codebases would be a lot easier, and at the same time code out e.g. lowlevel or OS specific code in the native string type)


skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2770
    • havefunsoft.com
Re: What is UTF-8 Application
« Reply #33 on: February 09, 2015, 04:19:12 pm »
The framework cannot possibly predict every single string operation possible.
Framework should not predict every single string operation possible. If it does it's very bad string processing framework.
Instead, a framework should provide necessary APIs, that would allow a developer to implement any higher-level operation as efficient as possible (reduced number of memory re-allocations)

I don't want the compiler to do automatic conversions in my back ;) I want control and I want 1 encoding in every operating system. And I don't see why I should put up with anything less than I expect.
the whole 5-year discussion is about what this 1-encoding should be right?
on the lower level (api's) the encoding is in control of the OS.
Even though *nix are "utf-8" by default a developer should keep in mind, that it might not be "utf8" after all. (I remember someone made an example of Austrian governmental desktops running linux of bsd in ansi encoding?!)
So dealing with encoding is still required at least these days.

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2770
    • havefunsoft.com
Re: What is UTF-8 Application
« Reply #34 on: February 09, 2015, 04:24:12 pm »
(Note that I meant that most code should be programmed as if it were, not that I meant it actually should be. Because then swapping the string basetype on codebases would be a lot easier, and at the same time code out e.g. lowlevel or OS specific code in the native string type)
I agree on that.

I'm not participating in utf-8 vs utf-16 discussion, because I expect Delphi to win in the end. RTL, FCL and LCL end up using UnicodeString as "strings" anyway. I've seen too many Delphi-compatible features contributed and accepted by FPC. I expect the same to happen here.  I believe LCL just need to wait for RTL to become fully-unicode and then switch.

for any old code or non-unicode targets (i.e. embedded) rawbytestring would be used.
« Last Edit: February 09, 2015, 04:27:10 pm by skalogryz »

felipemdc

  • Administrator
  • Hero Member
  • *
  • Posts: 3538
Re: What is UTF-8 Application
« Reply #35 on: February 09, 2015, 04:39:35 pm »
the whole 5-year discussion is about what this 1-encoding should be right?

There are many variants which were discussed all these years, the most commonly presented ones are:

1-> RTL/FCL with 2 modes: UTF-8 and UTF-16 modes
2-> UTF-16 everywhere
3-> UTF-8 in UNIX, UTF-16 in Windows

I want #1.

If forced I think I could live with #2, I'd have to recheck all my code everywhere where strings are used, so its nasty.

Solution #3 I despise, it would be really bad.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12655
  • FPC developer.
Re: What is UTF-8 Application
« Reply #36 on: February 09, 2015, 04:41:48 pm »

for any old code or non-unicode targets (i.e. embedded) rawbytestring would be used.

Rawbytestring is not really a stringtype that can be used as basetype since it requires manual conversions.

felipemdc

  • Administrator
  • Hero Member
  • *
  • Posts: 3538
Re: What is UTF-8 Application
« Reply #37 on: February 09, 2015, 04:47:47 pm »
So Marco, what will it be in UNIX? string=ansistring or string=unicodestring? Was that already decided?

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12655
  • FPC developer.
Re: What is UTF-8 Application
« Reply #38 on: February 09, 2015, 04:51:30 pm »
So Marco, what will it be in UNIX? string=ansistring or string=unicodestring? Was that already decided?

Nothing is decided. It is one of the points of the discussion that I gave up, and when I started argueing utf16 only.

Not having an unicodestring variant still makes it harder for the component builders, so it should be at least possible.

The original multiple release proposal was meant to simply see which of the variants would see the most use, and keep the codebase ready for either.  Most *nix specific code is in a few packages and core RTL, and is procedural, so the overloading trick of the RTL could be done there. After the initial investment it would be not that bad.

One of the main advantages of the multiple releases proposal is that it also serves as "keep something working" during the transition period, since it will be long. I always saw the Lazarus utf8 hack-as-default-type in that light.
« Last Edit: February 09, 2015, 04:53:41 pm by marcov »

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2770
    • havefunsoft.com
Re: What is UTF-8 Application
« Reply #39 on: February 09, 2015, 05:11:38 pm »
1-> RTL/FCL with 2 modes: UTF-8 and UTF-16 modes
2-> UTF-16 everywhere
3-> UTF-8 in UNIX, UTF-16 in Windows
Crystall ball suggests #2 will take over eventually. Less maintenance for fpc and lazarus team with a benefit of delphi compat.


taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: What is UTF-8 Application
« Reply #40 on: February 09, 2015, 05:17:30 pm »
Cocoa uses the stupid opaque type:

http://svn.freepascal.org/cgi-bin/viewvc.cgi/trunk/lcl/interfaces/cocoa/cocoawscommon.pas?view=markup&root=lazarus

function TLCLCommonCallback.KeyEvent(Event: NSEvent): Boolean;

    UTF8Character := NSStringToString(Event.characters);
382   
383       if Length(UTF8Character) > 0 then
384       begin
385         SendChar := True;
386   
387         if Utf8Character[1] <= #127 then
388           KeyChar := Utf8Character[1];
389   
390         // the VKKeyCode is independent of the modifier
391         // => use the VKKeyChar instead of the KeyChar
392         case VKKeyChar of
393           'a'..'z': VKKeyCode:=VK_A+ord(VKKeyChar)-ord('a');
394           'A'..'Z': VKKeyCode:=ord(VKKeyChar);
395           #27     : VKKeyCode:=VK_ESCAPE;
396           #8      : VKKeyCode:=VK_BACK;
397           ' '     : VKKeyCode:=VK_SPACE;
398           #13     : VKKeyCode:=VK_RETURN;
399           '0'..'9':
400             case KeyCode of
401               MK_NUMPAD0: VKKeyCode:=VK_NUMPAD0;
402               MK_NUMPAD1: VKKeyCode:=VK_NUMPAD1;
403               MK_NUMPAD2: VKKeyCode:=VK_NUMPAD2;
404               MK_NUMPAD3: VKKeyCode:=VK_NUMPAD3;
405               MK_NUMPAD4: VKKeyCode:=VK_NUMPAD4;
406               MK_NUMPAD5: VKKeyCode:=VK_NUMPAD5;
407               MK_NUMPAD6: VKKeyCode:=VK_NUMPAD6;
408               MK_NUMPAD7: VKKeyCode:=VK_NUMPAD7;
409               MK_NUMPAD8: VKKeyCode:=VK_NUMPAD8;
410               MK_NUMPAD9: VKKeyCode:=VK_NUMPAD9
411               else VKKeyCode:=ord(VKKeyChar);
412             end;
413           else

If I don't know the encoding how can I do case VKKeyChar of #27:  ?

Simple, you don't. You never, ever, use hardcoded values in your code you go const making sure that the const gets the correct value for the targeted system.


If you don't know the encoding, you lose all of the control over string operations, you are basically lost. All my software do string operations, and I *need* to know the encoding to them perfectly.

In the particular case above we need NSStringToString everywhere, to convert from Opaque to UTF-8.


"You lose all of the control"? how is that? I don't see anything here that a simple ifdefed const region can't solve and nothing to support your conclusions. what exactly did you lost?

In the example above its OK, because its in the LCL, so users are spared from the evil world of unknown encoding and they receive a pure UTF-8 interface. All imperfections are handled by the LCL. Just like any well designed framework would do. So the user code can be smaller, the LCL handles the system differences for the user.

If the LCL didn't handle it, the user would need to convert strings in his own code.

you over complicated things and I'm assuming that the above example is picked in a hurry and not very representative about your problems.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4680
  • I like bugs.
Re: What is UTF-8 Application
« Reply #41 on: February 10, 2015, 12:35:36 am »
Food for thought:
In
  http://wiki.freepascal.org/UTF8_strings_and_characters#Examples
all of the code snippets would work with both UTF-8 and UTF-16 at least after some wrapper function changes.

The first three use Pos(), Copy() and Length() which are used in typical Delphi code, too.

In the fourth one, iterating Unicode characters, the code should jump over the already handled parts. Then it surely works with any UTF-16 character, too. It would have an interesting side-effect: it would improve robustness of UTF-16 code. Typical Delphi code does not handle 2-word codepoints but this one does.

The rest need functions with different names but the semantics are the same.

That covers already many use cases. Source code can be made to support both encodings quite easily. Code for UTF-8 typically works with UTF-16 as is. To the other direction not always.

We will need both versions for LCL. How FPC will support that, let's see. It would not be an "opaque" type but selectable by IFDEF of something.

Anyway, so much energy is again wasted for arguing. People are defending their favorite encoding furiously. For what? The problem would be solved many times already with that energy.
It reminds me of the infamous SVN versus Git fight that lasted many years. Nobody bothered to check if tools already supported development with Git, they just wanted to argue with somebody.
There already was a Git-mirror and I tested the other development tools. Patch and other tools accepted Git format diffs and Lazarus developers promised to use them. Git was perfectly usable for Lazarus development all the time!
I did my own development already using Git-svn link. I documented both ways to use Git and even promised to support distributed devel model with it.
Nobody had an excuse to fight and complain any more, the problem was solved. The whiners did not offer patches as they kind of had promised, but that is ok.

Maybe I must solve the Unicode issue, too, to stop people wasting their energy for the endless arguing.
:)
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

malcome

  • Jr. Member
  • **
  • Posts: 81
Re: What is UTF-8 Application
« Reply #42 on: February 10, 2015, 12:56:06 am »
Lazarus Team, Create the UTF8 RTL .
FPC Team, Create the UTF16 IDE. stupid guys.

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1933
Re: What is UTF-8 Application
« Reply #43 on: February 10, 2015, 12:58:25 am »
Maybe I must solve the Unicode issue, too, to stop people wasting their energy for the endless arguing.

Yes, please!!   ;)

 

TinyPortal © 2005-2018