Recent

Author Topic: Extended ASCII use - 2  (Read 10253 times)

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 915
Re: Extended ASCII use - 2
« Reply #45 on: January 15, 2022, 04:12:24 pm »
Sounds interesting. Any specific example.

There are at least 3 plain 'E's in the default OpenOffice character set, one with a cedilla, one with a (rounded) breve and one with both a cedilla and a (pointy) breve. There is also a breve and cedilla as combining characters.So, at first, I expected I could make at least 6 characters that looked the same But from all those, I could only combine three.

I didn't try other character sets, but some googling showed Word to do the same. What happens is that the combining chars become small, separate chars that sit just to the right. Like you see here (if your browser displays that the same as mine). Even when they should go in the middle.

On the other hand, in a browser you can make Zalgo text, which stacks them all on top and below one another.

Quote
You might be up to something here, but without testing an actual code, it is hard to say.

My first idea was to turn every glyph into separate strokes, like with Chinese. That's definitely the simplest way to do it, but font designers wouldn't stand for it (although they might not like many different 'E's that all might have to be different, either). And it would be very hard to describe all letters like that, although in a sense that is what vector fonts already do. Computers think spotting patterns is hard. Perhaps with an AI.

The code you posted is interesting, they seem to analyze the separate parts (through tables and the amount of strokes). It is much easier to spot the base forms and combining marks, because we have a list. (Although not all the attachments exist as combining marks, like the rounded breve.) But that still results in glyphs that will have different byte lengths, especially the Arabic and Asian ones.

If you put those in something like a TStringList (one line for each glyph), you can simply use the index to represent it. But, you would have to store that table (the StringList) in each file as well. Which is also how compressed archive files work: you take the first byte, put it in the list and replace it with a '1'. Take the next and replace it with a '2'. Etc. And you attach the table to the end of the file.

Then again, archives use a binary tree to store the information, which is how they compress them. Which would be the UTF-8 equivalent. And at the other end of he scale you can use a fixed 32-bit index.

That would result in files like this:

- UTFX-8, UTFX-16 and UTFX-32, where the total amount of separate glyphs fits in that many bits.
- UTFX-Z, (like: zipped), where each glyph takes up a different amount of bits. Like, 1 bit for lower case 'e'. Or even better: the word 'the'. Or the Japanese / Chinese equivalent. You would have to expand that into one of the other formats to be able to process it. But it would be very compact to store and transmit.

Like with the other UTF files, you start with a 16-bit value that specifies the format. Followed with a 64-bit offset to the table used. Sorting and comparing becomes much easier, although you have to compare the values in the table, not the index value (which will be different for each file).

In the end, that's almost he same as we do now. Only the symbol table isn't static. Which I highly doubt it could ever be as long as we don't all use the same language.
« Last Edit: January 16, 2022, 01:21:13 pm by SymbolicFrank »

PascalDragon

  • Hero Member
  • *****
  • Posts: 4005
  • Compiler Developer
Re: Extended ASCII use - 2
« Reply #46 on: January 15, 2022, 06:42:08 pm »
Actually, the best way to store them would probably be like Huffman encoding (7-zip etc). Expand each character you come across, make a list and only store the index in your string or table. That way, they will all be the same when they look the same and fit in a single, 32-bit value. And always display them multi-pass, the parts on top of each other.

Please note that you're simplifying things. Not only do fonts support ligatures, meaning two characters next to each other might be rendered differently than if they are apart (and this can be used for rather insane things, like generating charts by justing writing text or especially highlighting symbols of programming languages).

But then there are emojis which make use of the ZERO WIDTH NON JOINER to combine different characters to generate a new glyph. E.g. the rainbow and a flag to generate the rainbow flag or the gender modifiers for the various gendered emojis or the different color variations of the family and pairing emojis (though most of this is strongly related to ligatures and its underlying mechanisms).

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 915
Re: Extended ASCII use - 2
« Reply #47 on: January 16, 2022, 01:20:35 pm »
Yes, I know that. But it makes no difference. In many languages, the "unit" of writing is not a letter, but a word. They can consist of smaller units, but they might not. Many such languages do both: words that are a single glyph and words that are combined.

What you (the programmer) are interested in, are the individual glyphs. If that is a complex picture that exists by itself, that is fine. If it consists of different glyphs that are "glued" together with zero-width spaces, that works as well. Kerning between different glyphs can be calculated (it is part of most font files).

More interesting are ones you can write different ways, like 'ẞ' (equal to 'SS'). Or in languages like Korean, where you can use two different alphabets (hangeul and hanja), mostly interchangeable. That can be done with either a table or multiple entries, although the latter would always print the same glyph, no matter which one was input.

Still, it won't be a fast or easy project. But in the long term it would beat Unicode on all fronts. And you could actually parse it ;)


Edit: also interesting is Arabic, where most glyphs are connected together without spaces. But (as far as I know), while most 'letters' are individual units, there are also long 'sentences' that are treated as a single glyph. So, spaces are not the primary separators. Even more so, I expect that parts of those long sequences could be individual units in different contexts. Still, that isn't much different than glyphs that consist of multiple letters and combining marks.

Also, I would be very interested to hear from people who don't use a variant of Latin, how it works for their language. Because I expect that most of those are 'tacked on afterwards'.
« Last Edit: January 16, 2022, 02:24:29 pm by SymbolicFrank »

 

TinyPortal © 2005-2018