Forum > General

Extended ASCII use - 2

<< < (9/10) > >>

tetrastes:
cwstring.pp knows that. There is code in WideStringToUCS4StringNoNulls for converting surrogate pairs.

SymbolicFrank:

--- Quote from: engkin on January 13, 2022, 05:49:44 pm ---To compare them,try using:

--- End quote ---

Very interesting, thanks!

PascalDragon:

--- Quote from: tetrastes on January 14, 2022, 01:07:21 pm ---
--- Quote from: PascalDragon on January 12, 2022, 01:53:54 pm ---
--- Quote from: tetrastes on January 12, 2022, 10:05:31 am ---Just interesting and having no time to read sources right now: if at unix we have to use clib UnicodeStringManager (uses cwstring), there is overhead converting 2-byte UnicodeChar to 4-byte wchar_t?

--- End quote ---

Not in the sense you think, because essentially no POSIX API expects wchar_t. Thus the TUnicodeStringManager never needs to convert from UTF-16 to UTF-32. But it needs to convert from UTF-16 to UTF-8 (assuming the system is set to UTF-8 which is essentially the default nowadays).


--- End quote ---

It seemed strange to me, that there is unused type, so I looked in cwstring.pp, and found some functions, POSIX or not, using wchar_t. And as they are used in cwstring.pp, there is overhead in the sense I think. For example:

--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---function wcscoll (__s1:pwchar_t; __s2:pwchar_t):cint;cdecl;external clib name 'wcscoll'; ... function CompareWideString(const s1, s2 : WideString; Options : TCompareOptions) : PtrInt;{$if not(defined (aix) and defined(cpupowerpc32))}  var    hs1,hs2 : UCS4String;    us1,us2 : WideString;      begin    { wcscoll interprets null chars as end-of-string -> filter out }    if coIgnoreCase in Options then      begin      us1:=UpperWideString(s1);      us2:=UpperWideString(s2);      end         else            begin         us1:=s1;      us2:=s2;      end;      hs1:=WideStringToUCS4StringNoNulls(us1);    hs2:=WideStringToUCS4StringNoNulls(us2);    result:=wcscoll(pwchar_t(hs1),pwchar_t(hs2));  end;{$else}  { AIX/PPC32 has a 16 bit wchar_t } 
where WideStringToUCS4StringNoNulls converts UTF-16 string to UTF-32, naturally.

--- End quote ---

*shrugs* That's how it is. But UnicodeString isn't used that often on Unix anyway. Lazarus uses UTF-8 and that goes through the AnsiString routines.

SymbolicFrank:
It seems that many applications, like Word and OpenOffice only render combining characters correctly when there already is another character that looks the same. They can only display a single glyph at any location. While a browser stacks them with an offset. In other words, what you see depends on the rendering engine used.

Is there one that actually combines them as is intended? LaTeX?

But that also means, that each application (depending on the Unicode table and rendering engine used), has it's own Unicode subset, that might or might not look the same as any of the others when put on the screen or printer.

And I think the best way to compare Unicode chars would be to split them out into the base shape and the separate, combining characters. Then again, that would require expanding those, as there are "attachments" not covered by them.

Ok, that would make it even harder to determine how much storage space you need to reserve.

Actually, the best way to store them would probably be like Huffman encoding (7-zip etc). Expand each character you come across, make a list and only store the index in your string or table. That way, they will all be the same when they look the same and fit in a single, 32-bit value. And always display them multi-pass, the parts on top of each other.

I think that's how Unicode should have been.

On the other hand, that won't fix the sorting problem. You still need a separate table for each language. Although you can limit those to only the base shape and attachments that make a difference.

engkin:

--- Quote from: SymbolicFrank on January 15, 2022, 01:38:06 pm ---It seems that many applications, like Word and OpenOffice only render combining characters correctly when there already is another character that looks the same. They can only display a single glyph at any location. While a browser stacks them with an offset. In other words, what you see depends on the rendering engine used.

--- End quote ---

Sounds interesting. Any specific example.


--- Quote from: SymbolicFrank on January 15, 2022, 01:38:06 pm ---Is there one that actually combines them as is intended? LaTeX?

--- End quote ---

There used to be a "layout engine" called ICU, it was abandoned and replaced with HarfBuzz. Both are open source, which made it easy to include in Firefox and Android.


--- Quote from: SymbolicFrank on January 15, 2022, 01:38:06 pm ---But that also means, that each application (depending on the Unicode table and rendering engine used), has it's own Unicode subset, that might or might not look the same as any of the others when put on the screen or printer.

--- End quote ---

Depends on their implementation, bugs, and settings. The same application/OS might have an extension to support, say, Complex Scripts. By default the extension is not installed or enabled because it has a small efficiency impact. You need to install/activate it for the rendering engine to give you the expected results.


--- Quote from: SymbolicFrank on January 15, 2022, 01:38:06 pm ---And I think the best way to compare Unicode chars would be to split them out into the base shape and the separate, combining characters. Then again, that would require expanding those, as there are "attachments" not covered by them.

Ok, that would make it even harder to determine how much storage space you need to reserve.

Actually, the best way to store them would probably be like Huffman encoding (7-zip etc). Expand each character you come across, make a list and only store the index in your string or table. That way, they will all be the same when they look the same and fit in a single, 32-bit value. And always display them multi-pass, the parts on top of each other.

I think that's how Unicode should have been.

On the other hand, that won't fix the sorting problem. You still need a separate table for each language. Although you can limit those to only the base shape and attachments that make a difference.

--- End quote ---

You might be up to something here, but without testing an actual code, it is hard to say.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version