Forum > General

Extended ASCII use - 2

<< < (8/10) > >>

SymbolicFrank:
Now only if there was a Unicode variant where each glyph took up the same amount of space...

UTF-32 sounds great, but that is still not fixed length, because of combining characters and grapheme clusters. Also, while using it seems wasteful to users of the Latin alphabet, because most glyphs take up just 7 bits, many other language users think it wastes space because only 21 bits are used. They need multiple 32-bit 'chars' to create the grapheme clusters they use for written language. And the information density of UTF-8 is even worse.

But because it is now the standard, don't hold your breath waiting for a new UTF-32 that actually uses all those bits to code for all the possible glyphs / grapheme clusters.

So, overall, Unicode does work, but just barely. Technically it's a jack of all trades, master of none. Best example: sorting.

SymbolicFrank:
I made a nice example:


--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---'Ḝ' <> 'Ḝ' <> 'Ḝ'
They might look the same, but aren't. Depending on your sort order and how it is sorted, they might end up together, but they might not.

Unicode and the applications that display the resulting glyphs are strangely inconsistent as to what can be combined into what.

engkin:

--- Quote from: SymbolicFrank on January 13, 2022, 11:18:02 am ---I made a nice example:


--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---'Ḝ' <> 'Ḝ' <> 'Ḝ'
They might look the same, but aren't. Depending on your sort order and how it is sorted, they might end up together, but they might not.

Unicode and the applications that display the resulting glyphs are strangely inconsistent as to what can be combined into what.

--- End quote ---

To compare them,try using:

--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---uses  unicodedata; function Test:boolean;var  s,o:array of string;  i:integer;begin  Result:=False;  s:=['Ḝ','Ḝ','Ḝ'];  SetLength(o,Length(s));  for i:=Low(s) to High(s) do  begin    o[i]:=NormalizeNFD(s[i]);    if o[Low(s)]<>o[i] then      exit;  end;  Result:=True;end; 

tetrastes:

--- Quote from: PascalDragon on January 12, 2022, 01:53:54 pm ---
--- Quote from: tetrastes on January 12, 2022, 10:05:31 am ---Just interesting and having no time to read sources right now: if at unix we have to use clib UnicodeStringManager (uses cwstring), there is overhead converting 2-byte UnicodeChar to 4-byte wchar_t?

--- End quote ---

Not in the sense you think, because essentially no POSIX API expects wchar_t. Thus the TUnicodeStringManager never needs to convert from UTF-16 to UTF-32. But it needs to convert from UTF-16 to UTF-8 (assuming the system is set to UTF-8 which is essentially the default nowadays).


--- End quote ---

It seemed strange to me, that there is unused type, so I looked in cwstring.pp, and found some functions, POSIX or not, using wchar_t. And as they are used in cwstring.pp, there is overhead in the sense I think. For example:

--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---function wcscoll (__s1:pwchar_t; __s2:pwchar_t):cint;cdecl;external clib name 'wcscoll'; ... function CompareWideString(const s1, s2 : WideString; Options : TCompareOptions) : PtrInt;{$if not(defined (aix) and defined(cpupowerpc32))}  var    hs1,hs2 : UCS4String;    us1,us2 : WideString;      begin    { wcscoll interprets null chars as end-of-string -> filter out }    if coIgnoreCase in Options then      begin      us1:=UpperWideString(s1);      us2:=UpperWideString(s2);      end         else            begin         us1:=s1;      us2:=s2;      end;      hs1:=WideStringToUCS4StringNoNulls(us1);    hs2:=WideStringToUCS4StringNoNulls(us2);    result:=wcscoll(pwchar_t(hs1),pwchar_t(hs2));  end;{$else}  { AIX/PPC32 has a 16 bit wchar_t } 
where WideStringToUCS4StringNoNulls converts UTF-16 string to UTF-32, naturally.

Thaddy:
Note it is a misunderstanding that UTF16 char is always two bytes. It can be 4 too.
Only UCS2 is always two bytes. Problem is that Delphi declared it before the UTF16 standard was enhanced and FPC adheres to the Delphi declaration. So wchar_t  refers to UTF8, UTF16 and UTF32.
Sooner or later this will lead to problems.
Note UCS2 equalled UTF16, but nowadays it is simply a 2 byte subset of UTF16.
The whole issue is surrogate pairs, which can make it size 4.
https://en.wikipedia.org/wiki/UTF-16 clearly states UTF16 is a variable length encoding.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version