Recent

Author Topic: Encoding agnostic functions for codepoints + an iterator  (Read 13214 times)

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4467
  • I like bugs.
Re: Encoding agnostic functions for codepoints + an iterator
« Reply #15 on: June 27, 2016, 05:04:29 pm »
Or the one I linked above
Too bad it is outdated now too
If I understand right your Internet Tools are for UTF-8 only. There is some overlap with LazUtils units.
Does it really support all Unicode combining rules (minus the latest additions)? How to make an enumerator for combined characters?

Quote
They released Unicode 9 last week
There are some new combining characters
Ok, that is not serious. Should be easy to fix.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

BeniBela

  • Hero Member
  • *****
  • Posts: 906
    • homepage
Re: Encoding agnostic functions for codepoints + an iterator
« Reply #16 on: June 27, 2016, 05:36:49 pm »
If I understand right your Internet Tools are for UTF-8 only.
yes

Does it really support all Unicode combining rules (minus the latest additions)?

yes, for those character that can be combined into  single codepoint.

There are two functions utf8proc_NFD and utf8proc_NFC to convert between decomposed and precomposed form (e.g. transforming between   ̈ a and ä, additional marks like on ḁ̩̬̪̆̃̊́ would remain unchanged). 
bbnormalizeunicode is based on theo's utf8proc port where I removed everything not needed for combining characters (got rid of 500kb of tables) and independent of all the other units in the repository. 
 

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4467
  • I like bugs.
Re: Encoding agnostic functions for codepoints + an iterator
« Reply #17 on: June 27, 2016, 08:34:51 pm »
yes, for those character that can be combined into  single codepoint.

You mean the precomposed characters that can be represented as one codepoint, or as a codepoint for an alphabet followed by an accent mark.
That is only a subset of all combined characters.
Different exotic languages have different rules for it. Some need big tables for the info, some are algorithmic. I don't know the details and I doubt I will learn them all.

It would be nice if your Unicode stuff was isolated and usable from other projects, too.
Now there is a separate package internettools_utf8 containing one unit, but it depends on internettools package.
The dependency should be the other way around.

bbunicodeinfo and bbnormalizeunicode have duplicate implementations of utf8proc_NFD and utf8proc_NFC functions.
I will study this more ...
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

BeniBela

  • Hero Member
  • *****
  • Posts: 906
    • homepage
Re: Encoding agnostic functions for codepoints + an iterator
« Reply #18 on: June 27, 2016, 09:59:54 pm »
You mean the precomposed characters that can be represented as one codepoint, or as a codepoint for an alphabet followed by an accent mark.
That is only a subset of all combined characters.

Everything that has a composed codepoint

It would be nice if your Unicode stuff was isolated and usable from other projects, too.
Now there is a separate package internettools_utf8 containing one unit, but it depends on internettools package.
The dependency should be the other way around.

I do not use packages, so just copy the pas/inc.

And there is a table generator in ruby: https://github.com/benibela/utf8proc/

internettools_utf8 used to have other dependencies. Now that I have my own utf8 functions, I will remove it altogether .


bbunicodeinfo and bbnormalizeunicode have duplicate implementations of utf8proc_NFD and utf8proc_NFC functions.
I will study this more ...

Somewhere on the web is unicodeinfo.pas, which is theo's port of utf8proc

bbunicodeinfo is unicodeinfo updated to the newest utf8proc version with unicode 8

bbnormalizeunicode is bbunicodeinfo with everything removed except the de/composition stuff

This also lead to three compile modes of the internet tools. bbnormalizeunicode together with the tables in BeRo's  FLRE for upper/lower case (we have come full circle), bbunicodeinfo or theo's unicodeinfo.

BeRo

  • New Member
  • *
  • Posts: 45
    • My site
Re: Encoding agnostic functions for codepoints + an iterator
« Reply #19 on: June 28, 2016, 05:39:08 am »

PUCU is updated to Unicode 9.0.0 now on GitHub.

And furthermore, you can update PUCU always yourself, just download the most Unicode dataset of ftp://www.unicode.org/Public/UCD/latest/ucd/ and ftp://www.unicode.org/Public/UCD/latest/ucd/extracted/ into the PUCU src/UnicodeData sub-directory and then rebuild PUCUConvertUnicode.dpr and execute it and then rebuild PUCUBuild.dpr and execute it and then you'll have already a PUCU.pas with current Unicode data tables.


serbod

  • Full Member
  • ***
  • Posts: 142
Re: Encoding agnostic functions for codepoints + an iterator
« Reply #20 on: June 28, 2016, 01:39:20 pm »
Then I made an enumerator for combined accented Unicode characters. I realize this is a can of worms because the rules for combining are so complex. Accented characters are only a subset.
Thus an external library like this one from BeRo would be ideal for combined codepoint stuff.

Such big library must be optional.

For start, minimal toolset for common use needed:
- codepoints iterator/array
- conversion (ANSI/UTF-8/UTF-16/UTF-32) and normalize functions (C/D/KC/KD)
- general functions (search/replace/compare/case conversion/trim/copy/delete)
- get codepoint and whole string metadata info functions (encoding, category, plane, block, script, etc..)

Some of them already provided by operating system, may be with some autoconversion (UTF-8 > UTF16 for Windows). Some can be simplified (for WGL4/W1G). No need to pull whole unicode universe into common application. If you use encoding-agnostic function, you don't care about conversion and details, visible result is same.

So, we don't need decomposed unicode iterator - just str.normalize(D) before codepoints iterator.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4467
  • I like bugs.
Re: Encoding agnostic functions for codepoints + an iterator
« Reply #21 on: July 12, 2016, 03:16:53 pm »
I have committed the units for encoding agnostic code into LazUtils package in Lazarus trunk.
See:
 http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#CodePoint_functions_for_encoding_agnostic_code

It deals with CodePoints and supports Delphi, too.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

 

TinyPortal © 2005-2018