Recent

Author Topic: Stop wasting time on FPC 2.6 and Laz 1.4!  (Read 49883 times)

mse

  • Sr. Member
  • ****
  • Posts: 286
Re: Stop wasting time on FPC 2.6 and Laz 1.4!
« Reply #15 on: November 01, 2015, 05:22:33 pm »
UTF-16 surrogate pairs don't have the same inherent properties as UTF-8 multi-byte codepoints have. If you use the fast Pos() etc. functions, it can go wrong sometimes. With UTF-8 it goes always right. Thus UTF-8 is faster in real-world applications when used cleverly.
Really? Please explain more where pos() fails with utf-16 and fails not with utf-8, I didn't know it.

mse

  • Sr. Member
  • ****
  • Posts: 286
Re: Stop wasting time on FPC 2.6 and Laz 1.4!
« Reply #16 on: November 01, 2015, 05:29:32 pm »
Also, I don't see any reason to oppose our UTF-8 solution because it does not take anything away from anybody, yet it solves many problems.
It is a nightmare for German pupils because of the umlauts.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4554
  • I like bugs.
Re: Stop wasting time on FPC 2.6 and Laz 1.4!
« Reply #17 on: November 01, 2015, 05:58:45 pm »
Really? Please explain more where pos() fails with utf-16 and fails not with utf-8, I didn't know it.

Hmmm... I have understood the second word in a surragate pair can be confused with a single-word codepoint. Is it not so? I tried to search for more info but did not find ... Uhhh!
If I have understood this wrong, it would not be the first time. I have understood some Unicode detail wrong maybe 20 - 30 times, stopped counting already.

Quote
It is a nightmare for German pupils because of the umlauts.

The umlaut occupies one codeunit in UTF-16 but more than one in UTF-8. Is this correct?
(Codeunit = 16 bits in UTF-16 and 8 bits in UTF-8).
Then all program code using such text must be correct. No sloppy code. :)
« Last Edit: November 01, 2015, 06:20:15 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

mischi

  • Full Member
  • ***
  • Posts: 178
Re: Stop wasting time on FPC 2.6 and Laz 1.4!
« Reply #18 on: November 01, 2015, 08:48:12 pm »
The umlaut occupies one codeunit in UTF-16 but more than one in UTF-8. Is this correct?
(Codeunit = 16 bits in UTF-16 and 8 bits in UTF-8).
Then all program code using such text must be correct. No sloppy code. :)
I think, you are wrong. As much as I know, a code points in UTF-8 use 1, 2, 3 or 4 bytes, in UTF-16, 2 or 4 bytes. Umlauts have another issue, namely two ways of representation. 1) as one code point with 2 bytes in UTF-8 AND UTF-16 2) as a composition of the combining character two dots (diaeresis) and the character. The latter makes 4 bytes in UTF-16 and 3 bytes in UTF-8. The conversion is called canonicalization (https://en.wikipedia.org/wiki/Canonicalization) or Unicode normalization. Unfortunately, linux and Mac OS X have chosen different Normal forms for their file systems. Mac OS X has chosen the decomposed form according to the Unicode NFD Normalization, linux the composed NFC.

mse

  • Sr. Member
  • ****
  • Posts: 286
Re: Stop wasting time on FPC 2.6 and Laz 1.4!
« Reply #19 on: November 02, 2015, 08:14:35 am »
Pupils often use code like
Code: Pascal  [Select][+][-]
  1.  if thestring[n] = 'Ä' then begin
  2.   ...
  3.  end;
  4.  
Now explain them why that doesn't work but
Code: Pascal  [Select][+][-]
  1.  if thestring[n] = 'A' then begin
  2.   ...
  3.  end;
  4.  
is OK. BTW there are not only pupils who like to use " if thecharacter = 'Ä' then". ;-)

Regarding pupils, I often see Lazarus homework code like
Code: Pascal  [Select][+][-]
  1.  thedisplaylabel.text = inttostr(strtoint(theedit1.text) + strtoint(theedit2.text));
  2.  

I think there should be an initiative to build a set of easy to use dedicated dataedit and datadisplay widgets with a unified interface as MSEgui provides. In MSEgui above code looks like
Code: Pascal  [Select][+][-]
  1.  thedisplay.value:= theedit1.value + theedit2.value;
  2.  
All data editwidgets have "onsetvalue" with a dedicated "value" parameter and "ondataentered" events:
Code: Pascal  [Select][+][-]
  1.  setintegereventty = procedure(const sender: tobject; var avalue: integer;
  2.                           var accept: boolean) of object;
  3.  setbooleaneventty = procedure(const sender: tobject; var avalue: boolean;
  4.                           var accept: boolean) of object;
  5. ...
  6.  notifyeventty = procedure (const sender: tobject) of object; //for ondataentered
  7.  
  8.  

fpGUI has a similar approach AFAIK.
In my opinion Lazarus often provides suboptimal solutions probably mostly because of the Delphi compatibility corset and the argument "if it is good enough for Delphi it is good enough for Lazarus". But that is wrong because for Delphi there are many high quality third party component sets available which can be used in place of the very limited original Delphi components.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4554
  • I like bugs.
Re: Stop wasting time on FPC 2.6 and Laz 1.4!
« Reply #20 on: November 02, 2015, 09:03:49 am »
I think, you are wrong. As much as I know, a code points in UTF-8 use 1, 2, 3 or 4 bytes, in UTF-16, 2 or 4 bytes.

Yes, in other words UTF-8 uses 1, 2, 3 or 4 codeunits for a codepoint, UTF-16 uses 1 or 2 codeunits for a codepoint.

Quote
Umlauts have another issue, namely two ways of representation. 1) as one code point with 2 bytes in UTF-8 AND UTF-16 2) as a composition of the combining character two dots (diaeresis) and the character. The latter makes 4 bytes in UTF-16 and 3 bytes in UTF-8. The conversion is called canonicalization (https://en.wikipedia.org/wiki/Canonicalization) or Unicode normalization. Unfortunately, linux and Mac OS X have chosen different Normal forms for their file systems. Mac OS X has chosen the decomposed form according to the Unicode NFD Normalization, linux the composed NFC.

Yes, that will always be a big pain. It does not depend on encoding though.
I think the Unicode definition should have avoided different ways of representation already from the beginning.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4554
  • I like bugs.
Re: Stop wasting time on FPC 2.6 and Laz 1.4!
« Reply #21 on: November 02, 2015, 10:06:29 am »
@mse, for "if thestring[n] = 'Ä' then" it sure makes a difference.
However this example looks a bit artificial to me. How often one needs to test for a constant umlaut? Typically one searches for a string typed in by user or read from some other source. Such data can contain any characters, including those encoded using surrogate pairs in UTF-16.

The bottom line is that once your pupils start to make "real" programs, they must do Unicode right.
A simple exercise is another thing of course.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Abelisto

  • Jr. Member
  • **
  • Posts: 91
Re: Stop wasting time on FPC 2.6 and Laz 1.4!
« Reply #22 on: November 02, 2015, 10:30:39 am »
Why we are talking about some minor things like "umlauts" while personally I need to work with cyrillic symbols? :) It is really strange to me this thread because I have no any troubles with multi-encoding text sources. May be I missed something. At the Linux UTF8 is natural and for the Windows we have nice LConvEncoding unit.
OS: Linux Mint + MATE, Compiler: FPC trunk (yes, I am risky!), IDE: Lazarus trunk

Graeme

  • Hero Member
  • *****
  • Posts: 1428
    • Graeme on the web
Re: Stop wasting time on FPC 2.6 and Laz 1.4!
« Reply #23 on: November 02, 2015, 10:58:01 am »
It is a nightmare for German pupils because of the umlauts.
Um? Please explain. What exactly is the "nightmare"? fpGUI uses UTF-8 internally, and I have many German, Russian, French, Afrikaans etc end-users, and nobody has every complained about UTF-8 causing them problems. On the HTTP pages are predominantly encoded in UTF-8 too, and again web browsers and other HTTP page processing applications don't seem to have a problem with UTF-8 either.
--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

Graeme

  • Hero Member
  • *****
  • Posts: 1428
    • Graeme on the web
Re: Stop wasting time on FPC 2.6 and Laz 1.4!
« Reply #24 on: November 02, 2015, 10:59:33 am »
The umlaut occupies one codeunit in UTF-16 but more than one in UTF-8. Is this correct?
No, you are wrong.
--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

Graeme

  • Hero Member
  • *****
  • Posts: 1428
    • Graeme on the web
Re: Stop wasting time on FPC 2.6 and Laz 1.4!
« Reply #25 on: November 02, 2015, 11:13:37 am »
the combining character two dots (diaeresis) and the character. The latter makes 4 bytes in UTF-16 and 3 bytes in UTF-8. The conversion is called canonicalization (https://en.wikipedia.org/wiki/Canonicalization) or Unicode normalization.
Finally, somebody that seems to understand Unicode. :)

Quote
Unfortunately, linux and Mac OS X have chosen different Normal forms for their file systems. Mac OS X has chosen the decomposed form according to the Unicode NFD Normalization, linux the composed NFC.
Yes and No. As far as I know, it isn't down to the OS which decides the Unicode normalization, but rather the File System. Linux supports multiple file systems. eg: If you run the ZFS file system under Linux, ZFS allows you to specify which Unicode normalization method you want to use (including many other settings like case sensitivity etc) in file or directory name comparisons.

For example: see this ZFS man page. Search for the term "normalization":
  http://manpages.ubuntu.com/manpages/oneiric/man8/zfs.8.html

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4554
  • I like bugs.
Re: Stop wasting time on FPC 2.6 and Laz 1.4!
« Reply #26 on: November 02, 2015, 11:22:28 am »
The umlaut occupies one codeunit in UTF-16 but more than one in UTF-8. Is this correct?
No, you are wrong.

It must be correct if Martin's exercise works.
Note, I wrote "codeunit" which is a 16 bit word in UTF-16.
I think you are referring to the other decomposed representation with 2 codepoints. Yes, 2 alternative representations is a real pain.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

mse

  • Sr. Member
  • ****
  • Posts: 286
Re: Stop wasting time on FPC 2.6 and Laz 1.4!
« Reply #27 on: November 02, 2015, 11:24:08 am »
It is a nightmare for German pupils because of the umlauts.
Um? Please explain.
Please read the German Lazarus forum, there are many questions about Lazarus and umlauts.
I don't care if Lazarus and fpGUI use utf-8 or utf-16 for GUI, I just wanted to point out the fact that experience shows that utf-16 for GUI is much more convenient. ;-)

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12050
  • FPC developer.
Re: Stop wasting time on FPC 2.6 and Laz 1.4!
« Reply #28 on: November 02, 2015, 11:25:35 am »
Pupils often use code like
Code: Pascal  [Select][+][-]
  1.  if thestring[n] = 'Ä' then begin
  2.   ...
  3.  end;
  4.  


Pupils also often use code like

Code: Pascal  [Select][+][-]
  1.  if somefloat=0.40 then ...
  2.  

So that means we have to abolish floating point too ?

Graeme

  • Hero Member
  • *****
  • Posts: 1428
    • Graeme on the web
Re: Stop wasting time on FPC 2.6 and Laz 1.4!
« Reply #29 on: November 02, 2015, 11:26:01 am »
@mse, for "if thestring[n] = 'Ä' then" it sure makes a difference.
However this example looks a bit artificial to me. How often one needs to test for a constant umlaut? Typically one searches for a string typed in by user or read from some other source. Such data can contain any characters, including those encoded using surrogate pairs in UTF-16.

The bottom line is that once your pupils start to make "real" programs, they must do Unicode right.
A simple exercise is another thing of course.
+1
Such examples are "artificial" indeed. Plus that example code is broken even for UTF-16 (code-points above BMP). If you say you are supporting Unicode, the support it ALL, even code-points above BMP.

Such [broken] examples stem from that fact that Delphi supported AnsiString for so long, and developers got use to the "hack" of referencing characters as string indexes. With Unicode you simply can't do that any more. You need to use functions that correctly find the character you want by using byte offset or code-point lookup in a string. Taking Unicode normalization into account is also crucial for text comparison.
--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

 

TinyPortal © 2005-2018