Recent

Author Topic: SelLength incorrect value for text containing characters > $FFFF  (Read 28430 times)

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #60 on: December 10, 2017, 05:53:08 pm »
Use UTF32 everywhere. Problem solved...
Thaddy, you clearly did not read the thread. The realization was that codepoints and their encodings are not the complicated part of Unicode. Using a fixed width codepoint encoding does not really solve anything.

Munair found a good article about it:
 https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/
« Last Edit: December 10, 2017, 05:55:33 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #61 on: December 10, 2017, 06:29:52 pm »
Quote
Is “x” equivalent to “𝗑” or “𝘅” or “𝘹” or “𝙭” or “𝚡” or “x” or “𝐱”?
𝗑 MATHEMATICAL SANS-SERIF SMALL X
𝘅 MATHEMATICAL SANS-SERIF BOLD SMALL X
𝘹 MATHEMATICAL SANS-SERIF ITALIC SMALL X
𝙭 MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL X
𝚡 MATHEMATICAL MONOSPACE SMALL X
x FULLWIDTH LATIN SMALL LETTER X
𝐱 MATHEMATICAL BOLD SMALL X

Let's apply PUCUUTF8UpperCase and see the result:
Quote
IS “X” EQUIVALENT TO “𝗑” OR “𝘅” OR “𝘹” OR “𝙭” OR “𝚡” OR “X” OR “𝐱”?

munair

  • Hero Member
  • *****
  • Posts: 798
  • compiler developer @SharpBASIC
    • SharpBASIC
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #62 on: December 10, 2017, 07:25:55 pm »
@Munair, your idea for DB with a language specific ID sounds desperate. It would be more complex and error prone than Unicode. We are just getting rid of locale specific ANSI codepages, let's no bring language specific IDs here.
Besides, Unicode is reality. Let's learn it. Most of its complexity is needed for the various human languages.

In essence, code pages or character sets are not at all bad. Programming becomes simpler when things are predefined. Most languages in the world have a fixed alphabet. Let's take advantage of that and store every language as a character set (together with at least one default font) by means of one universal encoding scheme. Simply do a database call "en" or "cn" to pull out the request character set at any time. The set could then be applied to anything that has the same language ID. No combining codepoints, no multiples of the same character. Errors could only arise from not tagging texts, documents or interfaces. It would be the same as defining a documents encoding. No one has any problem with that either. Programming would be a thousand times simpler.

Unicode is considered a success because it heavily relies on the BMP. Outside of that things become more complicated. Another advantage of character sets is that documents in any language would have more or less the same size. That would also be something to aim for. Go to China with this idea and you'll be popular in no-time.

Most programming languages are not capable of indexing grapheme clusters and are limited to indexing codepoints - which doesn't mean anything. How many programmers feel (un)comfortable with that?
keep it simple

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #63 on: December 10, 2017, 10:46:51 pm »
Unicode is considered a success because it heavily relies on the BMP. Outside of that things become more complicated.
That is about codepoints and their "planes". It makes a difference with UTF-16 encoding only. For other encodings it is rather irrelevent.
Not a big deal ...

Quote
Most programming languages are not capable of indexing grapheme clusters and are limited to indexing codepoints - which doesn't mean anything. How many programmers feel (un)comfortable with that?
Why would you index grapheme clusters? Even indexing codepoints is seldom useful as we can read from the article you found.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

munair

  • Hero Member
  • *****
  • Posts: 798
  • compiler developer @SharpBASIC
    • SharpBASIC
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #64 on: December 10, 2017, 11:06:51 pm »
Why would you index grapheme clusters? Even indexing codepoints is seldom useful as we can read from the article you found.
Sure, do away with any form of... well whatever operations one would need to do on text, cursor movement, selection etc...  ::)

Although I believe you know this, I will spell it out here anyway (for those interested)  :)

Take this Tai character ที่, which exists of three codepoints. I say 'character' as it is perceived as such by the Tai. Now do a UTF-8 string operation on Tai text where this characters happens to be near some split point. Without scanning for grapheme clusters, this character may be cut in pieces (which is what most software does):
Code: FreeBasic  [Select][+][-]
  1. s = "ที่"
  2. print s
  3. t = uleft(s, 2)
  4. s = t + uright(s, 1)
  5. print s
The character is being put back together again, but may still not be rendered correctly anymore. So the correct way to handle this is indexing and splitting by grapheme clusters... also explained in the good read. 8)
« Last Edit: December 11, 2017, 01:30:44 am by Munair »
keep it simple

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #65 on: December 11, 2017, 09:01:20 pm »
I am convinced it is worth the trouble to change them. A comprehensively consistent naming convention for Lazarus functions is nowhere more applicable than in the confusing world of unicode and text encoding functionality.
Well, I changed "Character" to "Codepoint" in LazUTF8 function names in r56692. Please take a look.
Some existing code may now throw many "deprecated" warnings. Let's see how many complaints we get.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

munair

  • Hero Member
  • *****
  • Posts: 798
  • compiler developer @SharpBASIC
    • SharpBASIC
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #66 on: December 14, 2017, 02:52:40 pm »
I am convinced it is worth the trouble to change them. A comprehensively consistent naming convention for Lazarus functions is nowhere more applicable than in the confusing world of unicode and text encoding functionality.
Well, I changed "Character" to "Codepoint" in LazUTF8 function names in r56692. Please take a look.
Some existing code may now throw many "deprecated" warnings. Let's see how many complaints we get.
Obviously, programmers had to get used to the idea of codepoints during the last several years, given the heavy use of CHLEN instead of CPLEN as identifier throughout the lazutf8.pas file.  ;)
keep it simple

munair

  • Hero Member
  • *****
  • Posts: 798
  • compiler developer @SharpBASIC
    • SharpBASIC
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #67 on: December 14, 2017, 02:59:48 pm »
BTW, how does UTF8IsCombining work, I mean, during what process, because I do not see it anywhere in the lazutf8.pas file, which means that the standard utf8 routines do not combine codepoints with separated diacritical marks.

UPDATE: OK, found it in lazunicode.pas
« Last Edit: December 14, 2017, 03:04:00 pm by Munair »
keep it simple

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1313
Re: SelLength incorrect value for text containing characters > $FFFF
« Reply #68 on: January 11, 2018, 03:07:26 pm »
A long time ago, WordPerfect came to DOS. And you could use it to write German and French characters. If you pressed F3, you could see that they were composed of multiple bytes. And it depended on your graphics adapter if they were visible on the screen.

The secretaries had these fancy IBM ball typewriters that could be connected to function as printers, but the problem was that you had to change the letter ball if you wanted different characters. Fortunately, WordPerfect had this fancy new thing called printer drivers, where you could define the translations and control sequences.

And so, one of the first things I have done as a (junior) programmer was to make diacritic chars by moving the print position back and printing the ampersands and dots on top of the previous char. Which was extra difficult because those printers knew about proportional printing and variable spacing. But it looked much better than letters printed on a dot matrix printer.

 

TinyPortal © 2005-2018