Recent

Author Topic: use special character in ncurses mvaddchar  (Read 3424 times)

Bogen85

  • Hero Member
  • *****
  • Posts: 595
Re: use special character in ncurses mvaddchar
« Reply #15 on: February 18, 2023, 05:09:17 am »
And converting the example to ncurses, it works when using mvaddstr:

Code: Pascal  [Select][+][-]
  1. program unicode_ex2;
  2.  
  3. {$mode objfpc}
  4. {$h+}
  5. {$codepage utf8}
  6. uses
  7.   initc,
  8.   ncurses,
  9.   sysutils,
  10.   types;
  11.  
  12. procedure setlocale (cat: integer; p: pChar); cdecl; external clib;
  13. const
  14.   LC_ALL = 6;
  15.  
  16. function utf8DisplayedChars (const str_in: string; const withCombiningDiacriticals: boolean = true): TStringDynArray;
  17.   procedure primary (const len: integer; offset: integer = 1; n_chars: integer = 0; remaining: integer = 0);
  18.     procedure secondary (const n_bytes: integer);
  19.       begin
  20.         result[n_chars] := copy(str_in, offset, n_bytes);
  21.         inc(offset, n_bytes);
  22.         dec(remaining, n_bytes);
  23.         inc(n_chars);
  24.       end;
  25.  
  26.     begin
  27.       setlength(result, len);
  28.       remaining := len;
  29.       while remaining > 0 do secondary(Utf8CodePointLen(@str_in[offset], remaining, withCombiningDiacriticals));
  30.       setlength(result, n_chars);
  31.     end;
  32.  
  33.   begin
  34.     result := [];
  35.     primary(length(str_in));
  36.   end;
  37.  
  38. const
  39.   boo: string = 'ábcdéfghíÁ̊ÅÁǺÁwowe̊́!é';
  40.  
  41. var
  42.   str: string;
  43.   y:   integer;
  44.  
  45. begin
  46.   setlocale(LC_ALL, '');
  47.   initscr();
  48.  
  49.   mvaddstr(1, 3, curses_version);
  50.   mvaddstr(3, 3, pChar(boo));
  51.  
  52.   y := 5;
  53.   for str in utf8DisplayedChars(boo) do begin
  54.     mvaddstr(y, 2, pChar(format('%s: %d', [str, length(str)])));
  55.     inc(y);
  56.   end;
  57.  
  58.   mvaddstr(y + 2, 7, pChar('Press any key to exit'));
  59.   getch();
  60.   endwin();
  61. end.

Resulting in:
Code: Text  [Select][+][-]
  1.  
  2.    ncurses 6.2.20210508
  3.  
  4.    ábcdéfghíÁ̊ÅÁǺÁwowe̊́!é
  5.  
  6.   á: 2
  7.   b: 1
  8.   c: 1
  9.   d: 1
  10.   é: 2
  11.   f: 1
  12.   g: 1
  13.   h: 1
  14.   í: 2
  15.   Á̊: 5
  16.   Å: 3
  17.   Á: 3
  18.   Ǻ: 5
  19.   Á: 2
  20.   w: 1
  21.   o: 1
  22.   w: 1
  23.   e̊́: 5
  24.   !: 1
  25.   é: 3
  26.  
  27.  
  28.        Press any key to exit


(Also see attached)

« Last Edit: February 18, 2023, 05:17:25 am by Bogen85 »

dbannon

  • Hero Member
  • *****
  • Posts: 2786
    • tomboy-ng, a rewrite of the classic Tomboy
Re: use special character in ncurses mvaddchar
« Reply #16 on: February 18, 2023, 06:54:31 am »

The lengths of those "characters" in that string range from 1 to 5 bytes.

I find that rather surprising, by definition, a UTF8 char is made up of 1 to 4 bytes.  What you are seeing there is a sequence of characters, one overwriting the other. Thats not, as I understand it, UTF8

Try pasting one of those 5 byte characters and then backspacing, you remove the overtype  with the first backspace !  (and no one is more surprised at that than me, maybe an artifact of the forum's means of displaying it ?)

On the other hand, the real UTF8 Å is a single character, two bytes, $C385. Or the one with the acute, Ǻ, $C7BA


Davo
Lazarus 3, Linux (and reluctantly Win10/11, OSX Monterey)
My Project - https://github.com/tomboy-notes/tomboy-ng and my github - https://github.com/davidbannon

Bogen85

  • Hero Member
  • *****
  • Posts: 595
Re: use special character in ncurses mvaddchar
« Reply #17 on: February 18, 2023, 02:09:00 pm »

The lengths of those "characters" in that string range from 1 to 5 bytes.

I find that rather surprising, by definition, a UTF8 char is made up of 1 to 4 bytes.  What you are seeing there is a sequence of characters, one overwriting the other. Thats not, as I understand it, UTF8

Try pasting one of those 5 byte characters and then backspacing, you remove the overtype  with the first backspace !  (and no one is more surprised at that than me, maybe an artifact of the forum's means of displaying it ?)

On the other hand, the real UTF8 Å is a single character, two bytes, $C385. Or the one with the acute, Ǻ, $C7BA

Some combinations go even higher than 5 bytes, as you can have any number of diacritical marks.
Yes, this was surprising to me to. But this caused problems for the OP of https://forum.lazarus.freepascal.org/index.php/topic,62150.msg470031.html until I showed the OP about the diacritical marks.

Take these 3 character combinations:
A  ́   ̊ 

A  ̊   ́ 

Put them next to each and remove the spaces and display them. I'm not doing anything special for my terminal to display those as a single combined character.

A proper UTF-8 display application will combine them all into one display character.

If you iterate a UTF-8 string and simply break it apart by 1-4 byte code points you will separate the diacritical markers from the characters they are being combined with, and you will then have problems displaying them correctly.

See https://en.wikipedia.org/wiki/Combining_Diacritical_Marks

You can have a base character followed by some (any?) number of combining diacritical marks.

So while it is true that a UTF-8 endpoint (incorrect term?) is 1-4 bytes, a single displayed character can be more. Therefore it is better to use UTF-8 strings to store single displayed characters, because a single displayed character can be 1 or more bytes, and is not limited to a 4 byte maximum.

« Last Edit: February 18, 2023, 04:09:31 pm by Bogen85 »

Bogen85

  • Hero Member
  • *****
  • Posts: 595
Re: use special character in ncurses mvaddchar
« Reply #18 on: February 18, 2023, 02:37:01 pm »
changing the constant string of characters to:
Code: Pascal  [Select][+][-]
  1. const
  2.   boo: string = 'oauo̥̊ḁ̊ů̥öäüö̥̊ḁ̈̊ü̥̊';
  3.  

Results in (:changed to |):

Code: Text  [Select][+][-]
  1. oauo̥̊ḁ̊ů̥öäüö̥̊ḁ̈̊ü̥̊
  2. o | 1
  3. a | 1
  4. u | 1
  5. o̥̊ | 5
  6. ḁ̊ | 5
  7. ů̥ | 5
  8. ö | 2
  9. ä | 2
  10. ü | 2
  11. ö̥̊ | 6
  12. ḁ̈̊ | 6
  13. ü̥̊ | 6
  14.  
On my terminal the characters and marks were not separated.

So that is an example with 6 byte "display characters" (which are multiple UTF-8 multi byte characters)
I'm sure one could go higher than 6... Point being, a "display character" can be the combination of one or more multi-byte characters).

It is the "or more" which for me means I will never assume some fixed byte size for displayable characters, and will treat displayable characters as UTF-8 strings containing one or more multi-byte unicode endpoints.

Which works with the ncurses example as well. See attached screenshots for both.

« Last Edit: February 18, 2023, 02:42:12 pm by Bogen85 »

Bogen85

  • Hero Member
  • *****
  • Posts: 595
Re: use special character in ncurses mvaddchar
« Reply #19 on: February 18, 2023, 03:13:40 pm »
I'll break it down even further...

Since the forum webpage is not always combining correctly, I've also included a screen shot.

Code: Text  [Select][+][-]
  1. oauo̥̊ḁ̊ů̥öäüö̥̊ḁ̈̊ü̥̊
  2.  
  3. 1:1: o  <-- first length is number of bytes, second length is number of unicode characters
  4. 6F - o
  5.  
  6. 1:1: a
  7. 61 - a
  8.  
  9. 1:1: u
  10. 75 - u
  11.  
  12. 5:3: o̥̊   <-- 5 bytes, made up of 3 Unicode characters
  13. 6F CC A5 CC 8A - o....
  14.  
  15. 5:3: ḁ̊    <-- 5 bytes, made up of 3 Unicode characters
  16. 61 CC A5 CC 8A - a....
  17.  
  18. 5:3: ů̥
  19. 75 CC A5 CC 8A - u....
  20.  
  21. 2:1: ö
  22. C3 B6 - ..
  23.  
  24. 2:1: ä
  25. C3 A4 - ..
  26.  
  27. 2:1: ü
  28. C3 BC - ..
  29.  
  30. 6:3: ö̥̊
  31. C3 B6 CC A5 CC 8A - ......
  32.  
  33. 6:3: ḁ̈̊
  34. C3 A4 CC A5 CC 8A - ......
  35.  
  36. 6:3: ü̥̊   <-- 6 bytes, made up of 3 Unicode characters
  37. C3 BC CC A5 CC 8A - ......

« Last Edit: February 18, 2023, 03:53:54 pm by Bogen85 »

Bogen85

  • Hero Member
  • *****
  • Posts: 595
Re: use special character in ncurses mvaddchar
« Reply #20 on: February 18, 2023, 04:19:26 pm »

The lengths of those "characters" in that string range from 1 to 5 bytes.

I find that rather surprising, by definition, a UTF8 char is made up of 1 to 4 bytes.  What you are seeing there is a sequence of characters, one overwriting the other. Thats not, as I understand it, UTF8

And the overwriting (combining) is part of the standard.

See: https://www.freepascal.org/docs-html/rtl/system/utf8codepointlen.html

Which I already provided the link for: https://en.wikipedia.org/wiki/Combining_Diacritical_Marks

And for utf8codepointlen:

See the following:
 https://gitlab.com/freepascal.org/fpc/source/-/blob/ffa14ee4485dbb452fe4a89b9c7a6340ea359c7f/rtl/inc/generic.inc#L1147
 https://gitlab.com/freepascal.org/fpc/source/-/blob/ffa14ee4485dbb452fe4a89b9c7a6340ea359c7f/rtl/inc/generic.inc#L1089

« Last Edit: February 18, 2023, 04:47:09 pm by Bogen85 »

Bogen85

  • Hero Member
  • *****
  • Posts: 595
Re: use special character in ncurses mvaddchar
« Reply #21 on: February 18, 2023, 04:50:57 pm »
From https://www.freepascal.org/docs-html/rtl/system/utf8codepointlen.html :

Quote
Description
Utf8CodePointLen returns the length of the UTF-8 codepoint starting at the beginning of P. It will look at at most MaxLookAhead bytes to do create this codepoint. If IncludeCombiningDiacriticalMarks is true, combining diacritical marks trailing the first codepoint (which itself can also be such a mark) will be considered to be part of the codepoint.

While technically that might not be part of the standard (considering the combined codepoints as a single codepoint) it is part of the of the standard that multiple codepoints can be combined to form a single displayed character with marks on it.
« Last Edit: February 18, 2023, 05:03:44 pm by Bogen85 »

Bogen85

  • Hero Member
  • *****
  • Posts: 595
Re: use special character in ncurses mvaddchar
« Reply #22 on: February 18, 2023, 04:55:28 pm »
All of this may have seemed like a deviation from the topic "use special character in ncurses mvaddchar".

However, the point of the deviation is to point out that displaying a special character with a some "add character" function might not work for a lot of unicode combinations (that are following the standard) of characters.

This is not a problem specific to ncurses.
It is a problem whenever a unicode string is broken up into subparts, where those subparts are to be kept intact, as far as the displayed characters they represent.

dbannon

  • Hero Member
  • *****
  • Posts: 2786
    • tomboy-ng, a rewrite of the classic Tomboy
Re: use special character in ncurses mvaddchar
« Reply #23 on: February 19, 2023, 12:56:12 pm »
While not disputing anything you say here Bogen85, the point is that you are NOT displaying a UTF8 character under these circumstances. (Its decomposed Unicode ?).

While it might look similar to a UTF8 character, the bytes stored in a string has no relationship to the bytes stored when a real UTF8 character is involved.  So, if you know that some component, some display system, something can display UTF8, you cannot assume that it can display the overwritten method as well. And visa versa.

As you noted, the forum does a less than perfect job of displaying overwritten characters but it does a perfect job of displaying UTF8 characters. My guess is that what we see with overwritten characters may be browser (or webkit) dependent.

I tend to use https://www.utf8-chartable.de as my reference to UTF8 characters, there you will find all you can imagine (and a few more) without overwriting anything.

Davo

edit : typo


« Last Edit: February 19, 2023, 01:13:26 pm by dbannon »
Lazarus 3, Linux (and reluctantly Win10/11, OSX Monterey)
My Project - https://github.com/tomboy-notes/tomboy-ng and my github - https://github.com/davidbannon

Bogen85

  • Hero Member
  • *****
  • Posts: 595
Re: use special character in ncurses mvaddchar
« Reply #24 on: February 19, 2023, 01:21:06 pm »
@dbannon, and I'm not disputing what you are saying.

I included all I said in the context of displaying unicode characters from a string, especially if one is to take apart said string and display the characters individually. That can cause problems. The point of what I'm trying to saying to say, is if you break up a string into into its displayable characters, you should into account the composite parts (which may be more than one unicode character). Therefore, single character representation (2 bytes, 4 bytes, 8 bytes) is not going to work, the composite parts will need to be strings.

If your display system can't display those composite parts correct (not every unicode aware display system is totally compliant), then yes, there is a problem, and it is not always solvable (the exact equivalent character will not always be found).

I've worked with individuals who work with language scripts where those languages are not in the unicode standard. As such, they rely on diacritical marker combining that is supported by the unicode standard. While I only recently discovered this "feature" for myself in unicode I'd seen others use it before.

As was already mentioned in this topic/thread, string output should be used for individual display characters (which may be a composite of multiple unicode characters), not only for the composite issues, but because what encoding are you to chose for individual characters? utf-8? utf-16? full 32 bit?

I tend to use https://www.utf8-chartable.de as my reference to UTF8 characters, there you will find all you can imagine (and a few more) without overwriting anything.

That is an extremely small set of unicode characters and displayable combinations. Sure, it is fine for many who are just working with Western European languages, but not much beyond that.
« Last Edit: February 19, 2023, 01:47:05 pm by Bogen85 »

dbannon

  • Hero Member
  • *****
  • Posts: 2786
    • tomboy-ng, a rewrite of the classic Tomboy
Re: use special character in ncurses mvaddchar
« Reply #25 on: February 19, 2023, 11:59:50 pm »
Quote
As was already mentioned in this topic/thread, string output should be used for individual display characters (which may be a composite of multiple unicode characters), not only for the composite issues, but because what encoding are you to chose for individual characters? utf-8? utf-16? full 32 bit?
Definitly agree. If I know I are dealing with only UTF8 (and as a Linux user, a Lazarus user thats an easy choice) one character at a time, then it is
Code: Pascal  [Select][+][-]
  1. type Tutf8Char = string[4];
   
Otherwise, its ansistring and the FPC/Lazarus developers are looking after me.
Quote
I tend to use https://www.utf8-chartable.de as my reference to UTF8 characters, there you will find all you can imagine (and a few more) without overwriting anything.

That is an extremely small set of unicode characters and displayable combinations. Sure, it is fine for many who are just working with Western European languages, but not much beyond that.
MONGOLIAN, PHOENICIAN, MEROITIC HIEROGLYPHIC, CHEROKEE, MYANMAR, THAI, .....

In the "control panel", top of each page, third control down, "try other block". Its actually a very large set. And, yes, of course in includes $f09382ba, much loved by first year archeology students.

Davo
Lazarus 3, Linux (and reluctantly Win10/11, OSX Monterey)
My Project - https://github.com/tomboy-notes/tomboy-ng and my github - https://github.com/davidbannon

Bogen85

  • Hero Member
  • *****
  • Posts: 595
Re: use special character in ncurses mvaddchar
« Reply #26 on: February 20, 2023, 02:04:55 am »
MONGOLIAN, PHOENICIAN, MEROITIC HIEROGLYPHIC, CHEROKEE, MYANMAR, THAI, .....
In the "control panel", top of each page, third control down, "try other block". Its actually a very large set. And, yes, of course in includes $f09382ba, much loved by first year archeology students.

I see. Thanks, I over looked that.

There are many languages (many is relative) that have less than a few thousand speakers. For the most part they can use existing scripts where the all the needed character combinations are covered by existing Unicode blocks. For some of the languages that are not covered they somethings need to use combining diacritical markers.

For instance, the Navajo language (170K speakers?, more than a few thousand) is not in the control blocks on the site you mentioned (at least I did not find one).

https://unicode.org/faq/char_combmark.html#12
Quote
The Navajo-specific question below is also applicable to a wide variety of similar cases.

Q: Unicode doesn't contain some of the precomposed characters needed for Navajo and other indigenous languages of the Americas. Will you add them?

The way to encode the various Navajo letters with diacritics is with the use of combining marks. For example, Navajo high-toned nasalized vowels:

a + ogonek + acute = <U+0061, U+0328, U+0301> ( ǫ́ )

and so on for the other vowels.
That is also mentioned here: https://itch.io/t/503554/adding-a-navajo-ogonek-to-combining-diacritical-marks-extended-unicode-block

« Last Edit: February 20, 2023, 02:06:58 am by Bogen85 »

dbannon

  • Hero Member
  • *****
  • Posts: 2786
    • tomboy-ng, a rewrite of the classic Tomboy
Re: use special character in ncurses mvaddchar
« Reply #27 on: February 20, 2023, 08:15:40 am »

There are many languages (many is relative) that have less than a few thousand speakers.


Sadly, in my country there are many languages spoken by as little as half a dozen people. The Australian Aboriginal langues are both diverse and very, very endangered. I was asked to look at a problem some time ago where a project to record several then endangered languages had its data on a propriety portable disk system that no one had a working drive for any more. Nothing could be done.

So, keeping records is important, even if it has to use decomposed characters ! 

Davo
Lazarus 3, Linux (and reluctantly Win10/11, OSX Monterey)
My Project - https://github.com/tomboy-notes/tomboy-ng and my github - https://github.com/davidbannon

 

TinyPortal © 2005-2018