Recent

Author Topic: Extended ASCII use - 2  (Read 12248 times)

tetrastes

  • Sr. Member
  • ****
  • Posts: 473
Re: Extended ASCII use - 2
« Reply #30 on: January 12, 2022, 10:05:31 am »
Just interesting and having no time to read sources right now: if at unix we have to use clib UnicodeStringManager (uses cwstring), there is overhead converting 2-byte UnicodeChar to 4-byte wchar_t?

tetrastes

  • Sr. Member
  • ****
  • Posts: 473
Re: Extended ASCII use - 2
« Reply #31 on: January 12, 2022, 11:05:14 am »
Right, and that is what raymond wants.
By the way, it seems that text-mode Free Pascal IDE (fp) at linux does what he wants, at least at my Debian 10 with fpc 3.2.0. It uses CP437 (though not 850). Opening my sources I see pseudo graphics instead of UTF8 letters, and Tools -> Ascii table shows the CP437 table.
I don't understand this decision. I think it needs some efforts not to use UTF8 at UTF8 system.  :-\
Not saying that it makes fp rather useless, if you need non-ASCII output from your apps.
« Last Edit: January 12, 2022, 11:14:48 am by tetrastes »

raymond

  • New member
  • *
  • Posts: 7
Re: Extended ASCII use - 2
« Reply #32 on: January 12, 2022, 11:24:37 am »
Thank you for all the help and comments. Having used the logical simplicity and elegance of code page 850 for at least two decades, it's easy to be deceived by UTF'8', and difficult to understand why
a value between hex 00 and hex ff gets to be compiled into 2 bytes. It would be tempting to convert back  to 850 but there is the issue of Linux (Mint) environment variables, let alone the editor (gedit).
Bear in mind that in most 'natural' 'European' language texts there is an unpredictable mixing of '>127'
chars and '< 128' chars. It looks as though I will have to fudge a solution for my programming purposes. Thank you once again.

tetrastes

  • Sr. Member
  • ****
  • Posts: 473
Re: Extended ASCII use - 2
« Reply #33 on: January 12, 2022, 12:01:51 pm »
I am sorry, but I don't understand what is "the logical simplicity and elegance of code page 850", and why it is better than CP437, or 1252, or 8859-1, for example. I think there are may be problems with CP850 at linux, as with other DOS and Windows code pages, so if you want one-byte code  page, try 8859-1 (or 8859-15). They are native for unix, and once upon a time they were default.
« Last Edit: January 12, 2022, 12:04:56 pm by tetrastes »

PascalDragon

  • Hero Member
  • *****
  • Posts: 5446
  • Compiler Developer
Re: Extended ASCII use - 2
« Reply #34 on: January 12, 2022, 01:53:54 pm »
Just interesting and having no time to read sources right now: if at unix we have to use clib UnicodeStringManager (uses cwstring), there is overhead converting 2-byte UnicodeChar to 4-byte wchar_t?

Not in the sense you think, because essentially no POSIX API expects wchar_t. Thus the TUnicodeStringManager never needs to convert from UTF-16 to UTF-32. But it needs to convert from UTF-16 to UTF-8 (assuming the system is set to UTF-8 which is essentially the default nowadays).

Right, and that is what raymond wants.
By the way, it seems that text-mode Free Pascal IDE (fp) at linux does what he wants, at least at my Debian 10 with fpc 3.2.0. It uses CP437 (though not 850). Opening my sources I see pseudo graphics instead of UTF8 letters, and Tools -> Ascii table shows the CP437 table.
I don't understand this decision. I think it needs some efforts not to use UTF8 at UTF8 system.  :-\
Not saying that it makes fp rather useless, if you need non-ASCII output from your apps.

Free Vision currently simply does not support Unicode. There is work underway (and more or less already finished) that extends Free Vision and thus also the text mode IDE with Unicode support.

Thank you for all the help and comments. Having used the logical simplicity and elegance of code page 850 for at least two decades, it's easy to be deceived by UTF'8', and difficult to understand why
a value between hex 00 and hex ff gets to be compiled into 2 bytes. It would be tempting to convert back  to 850 but there is the issue of Linux (Mint) environment variables, let alone the editor (gedit).
Bear in mind that in most 'natural' 'European' language texts there is an unpredictable mixing of '>127'
chars and '< 128' chars. It looks as though I will have to fudge a solution for my programming purposes. Thank you once again.

There needs to be some kind of encoding, because Unicode supports > $FF characters. So the code points < $80 are used as-is (which is just ASCII), but anything greater (in the first Byte) is used to encode how many further Bytes follow as UTF-8 supports code points with 1 to 4 Byte. This way the full range of Unicode code points can be used without wasting space for many 0s like UTF-16 and UTF-32 need (assuming the text in question is mainly ASCII).

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1313
Re: Extended ASCII use - 2
« Reply #35 on: January 12, 2022, 02:57:16 pm »
Now only if there was a Unicode variant where each glyph took up the same amount of space...

UTF-32 sounds great, but that is still not fixed length, because of combining characters and grapheme clusters. Also, while using it seems wasteful to users of the Latin alphabet, because most glyphs take up just 7 bits, many other language users think it wastes space because only 21 bits are used. They need multiple 32-bit 'chars' to create the grapheme clusters they use for written language. And the information density of UTF-8 is even worse.

But because it is now the standard, don't hold your breath waiting for a new UTF-32 that actually uses all those bits to code for all the possible glyphs / grapheme clusters.

So, overall, Unicode does work, but just barely. Technically it's a jack of all trades, master of none. Best example: sorting.
« Last Edit: January 12, 2022, 02:58:58 pm by SymbolicFrank »

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1313
Re: Extended ASCII use - 2
« Reply #36 on: January 13, 2022, 11:18:02 am »
I made a nice example:

Code: Pascal  [Select][+][-]
  1. 'Ḝ' <> 'Ḝ' <> 'Ḝ'

They might look the same, but aren't. Depending on your sort order and how it is sorted, they might end up together, but they might not.

Unicode and the applications that display the resulting glyphs are strangely inconsistent as to what can be combined into what.
« Last Edit: January 13, 2022, 11:19:51 am by SymbolicFrank »

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Extended ASCII use - 2
« Reply #37 on: January 13, 2022, 05:49:44 pm »
I made a nice example:

Code: Pascal  [Select][+][-]
  1. 'Ḝ' <> 'Ḝ' <> 'Ḝ'

They might look the same, but aren't. Depending on your sort order and how it is sorted, they might end up together, but they might not.

Unicode and the applications that display the resulting glyphs are strangely inconsistent as to what can be combined into what.

To compare them,try using:
Code: Pascal  [Select][+][-]
  1. uses
  2.   unicodedata;
  3.  
  4. function Test:boolean;
  5. var
  6.   s,o:array of string;
  7.   i:integer;
  8. begin
  9.   Result:=False;
  10.   s:=['Ḝ','Ḝ','Ḝ'];
  11.   SetLength(o,Length(s));
  12.   for i:=Low(s) to High(s) do
  13.   begin
  14.     o[i]:=NormalizeNFD(s[i]);
  15.     if o[Low(s)]<>o[i] then
  16.       exit;
  17.   end;
  18.   Result:=True;
  19. end;
  20.  

tetrastes

  • Sr. Member
  • ****
  • Posts: 473
Re: Extended ASCII use - 2
« Reply #38 on: January 14, 2022, 01:07:21 pm »
Just interesting and having no time to read sources right now: if at unix we have to use clib UnicodeStringManager (uses cwstring), there is overhead converting 2-byte UnicodeChar to 4-byte wchar_t?

Not in the sense you think, because essentially no POSIX API expects wchar_t. Thus the TUnicodeStringManager never needs to convert from UTF-16 to UTF-32. But it needs to convert from UTF-16 to UTF-8 (assuming the system is set to UTF-8 which is essentially the default nowadays).


It seemed strange to me, that there is unused type, so I looked in cwstring.pp, and found some functions, POSIX or not, using wchar_t. And as they are used in cwstring.pp, there is overhead in the sense I think. For example:
Code: Pascal  [Select][+][-]
  1. function wcscoll (__s1:pwchar_t; __s2:pwchar_t):cint;cdecl;external clib name 'wcscoll';
  2.  
  3. ...
  4.  
  5. function CompareWideString(const s1, s2 : WideString; Options : TCompareOptions) : PtrInt;
  6. {$if not(defined (aix) and defined(cpupowerpc32))}
  7.   var
  8.     hs1,hs2 : UCS4String;
  9.     us1,us2 : WideString;
  10.    
  11.   begin
  12.     { wcscoll interprets null chars as end-of-string -> filter out }
  13.     if coIgnoreCase in Options then
  14.       begin
  15.       us1:=UpperWideString(s1);
  16.       us2:=UpperWideString(s2);
  17.       end    
  18.     else      
  19.       begin  
  20.       us1:=s1;
  21.       us2:=s2;
  22.       end;  
  23.     hs1:=WideStringToUCS4StringNoNulls(us1);
  24.     hs2:=WideStringToUCS4StringNoNulls(us2);
  25.     result:=wcscoll(pwchar_t(hs1),pwchar_t(hs2));
  26.   end;
  27. {$else}
  28.   { AIX/PPC32 has a 16 bit wchar_t }
  29.  

where WideStringToUCS4StringNoNulls converts UTF-16 string to UTF-32, naturally.

Thaddy

  • Hero Member
  • *****
  • Posts: 14205
  • Probably until I exterminate Putin.
Re: Extended ASCII use - 2
« Reply #39 on: January 14, 2022, 01:13:10 pm »
Note it is a misunderstanding that UTF16 char is always two bytes. It can be 4 too.
Only UCS2 is always two bytes. Problem is that Delphi declared it before the UTF16 standard was enhanced and FPC adheres to the Delphi declaration. So wchar_t  refers to UTF8, UTF16 and UTF32.
Sooner or later this will lead to problems.
Note UCS2 equalled UTF16, but nowadays it is simply a 2 byte subset of UTF16.
The whole issue is surrogate pairs, which can make it size 4.
https://en.wikipedia.org/wiki/UTF-16 clearly states UTF16 is a variable length encoding.
« Last Edit: January 14, 2022, 01:36:58 pm by Thaddy »
Specialize a type, not a var.

tetrastes

  • Sr. Member
  • ****
  • Posts: 473
Re: Extended ASCII use - 2
« Reply #40 on: January 14, 2022, 01:46:44 pm »
cwstring.pp knows that. There is code in WideStringToUCS4StringNoNulls for converting surrogate pairs.

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1313
Re: Extended ASCII use - 2
« Reply #41 on: January 14, 2022, 04:55:23 pm »
To compare them,try using:

Very interesting, thanks!

PascalDragon

  • Hero Member
  • *****
  • Posts: 5446
  • Compiler Developer
Re: Extended ASCII use - 2
« Reply #42 on: January 14, 2022, 09:01:22 pm »
Just interesting and having no time to read sources right now: if at unix we have to use clib UnicodeStringManager (uses cwstring), there is overhead converting 2-byte UnicodeChar to 4-byte wchar_t?

Not in the sense you think, because essentially no POSIX API expects wchar_t. Thus the TUnicodeStringManager never needs to convert from UTF-16 to UTF-32. But it needs to convert from UTF-16 to UTF-8 (assuming the system is set to UTF-8 which is essentially the default nowadays).


It seemed strange to me, that there is unused type, so I looked in cwstring.pp, and found some functions, POSIX or not, using wchar_t. And as they are used in cwstring.pp, there is overhead in the sense I think. For example:
Code: Pascal  [Select][+][-]
  1. function wcscoll (__s1:pwchar_t; __s2:pwchar_t):cint;cdecl;external clib name 'wcscoll';
  2.  
  3. ...
  4.  
  5. function CompareWideString(const s1, s2 : WideString; Options : TCompareOptions) : PtrInt;
  6. {$if not(defined (aix) and defined(cpupowerpc32))}
  7.   var
  8.     hs1,hs2 : UCS4String;
  9.     us1,us2 : WideString;
  10.    
  11.   begin
  12.     { wcscoll interprets null chars as end-of-string -> filter out }
  13.     if coIgnoreCase in Options then
  14.       begin
  15.       us1:=UpperWideString(s1);
  16.       us2:=UpperWideString(s2);
  17.       end    
  18.     else      
  19.       begin  
  20.       us1:=s1;
  21.       us2:=s2;
  22.       end;  
  23.     hs1:=WideStringToUCS4StringNoNulls(us1);
  24.     hs2:=WideStringToUCS4StringNoNulls(us2);
  25.     result:=wcscoll(pwchar_t(hs1),pwchar_t(hs2));
  26.   end;
  27. {$else}
  28.   { AIX/PPC32 has a 16 bit wchar_t }
  29.  

where WideStringToUCS4StringNoNulls converts UTF-16 string to UTF-32, naturally.

*shrugs* That's how it is. But UnicodeString isn't used that often on Unix anyway. Lazarus uses UTF-8 and that goes through the AnsiString routines.

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1313
Re: Extended ASCII use - 2
« Reply #43 on: January 15, 2022, 01:38:06 pm »
It seems that many applications, like Word and OpenOffice only render combining characters correctly when there already is another character that looks the same. They can only display a single glyph at any location. While a browser stacks them with an offset. In other words, what you see depends on the rendering engine used.

Is there one that actually combines them as is intended? LaTeX?

But that also means, that each application (depending on the Unicode table and rendering engine used), has it's own Unicode subset, that might or might not look the same as any of the others when put on the screen or printer.

And I think the best way to compare Unicode chars would be to split them out into the base shape and the separate, combining characters. Then again, that would require expanding those, as there are "attachments" not covered by them.

Ok, that would make it even harder to determine how much storage space you need to reserve.

Actually, the best way to store them would probably be like Huffman encoding (7-zip etc). Expand each character you come across, make a list and only store the index in your string or table. That way, they will all be the same when they look the same and fit in a single, 32-bit value. And always display them multi-pass, the parts on top of each other.

I think that's how Unicode should have been.

On the other hand, that won't fix the sorting problem. You still need a separate table for each language. Although you can limit those to only the base shape and attachments that make a difference.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Extended ASCII use - 2
« Reply #44 on: January 15, 2022, 03:17:07 pm »
It seems that many applications, like Word and OpenOffice only render combining characters correctly when there already is another character that looks the same. They can only display a single glyph at any location. While a browser stacks them with an offset. In other words, what you see depends on the rendering engine used.

Sounds interesting. Any specific example.

Is there one that actually combines them as is intended? LaTeX?

There used to be a "layout engine" called ICU, it was abandoned and replaced with HarfBuzz. Both are open source, which made it easy to include in Firefox and Android.

But that also means, that each application (depending on the Unicode table and rendering engine used), has it's own Unicode subset, that might or might not look the same as any of the others when put on the screen or printer.

Depends on their implementation, bugs, and settings. The same application/OS might have an extension to support, say, Complex Scripts. By default the extension is not installed or enabled because it has a small efficiency impact. You need to install/activate it for the rendering engine to give you the expected results.

And I think the best way to compare Unicode chars would be to split them out into the base shape and the separate, combining characters. Then again, that would require expanding those, as there are "attachments" not covered by them.

Ok, that would make it even harder to determine how much storage space you need to reserve.

Actually, the best way to store them would probably be like Huffman encoding (7-zip etc). Expand each character you come across, make a list and only store the index in your string or table. That way, they will all be the same when they look the same and fit in a single, 32-bit value. And always display them multi-pass, the parts on top of each other.

I think that's how Unicode should have been.

On the other hand, that won't fix the sorting problem. You still need a separate table for each language. Although you can limit those to only the base shape and attachments that make a difference.

You might be up to something here, but without testing an actual code, it is hard to say.

 

TinyPortal © 2005-2018