Encoding agnostic functions for codepoints + an iterator

JuhaManninen

Global Moderator
Hero Member
Posts: 4467
I like bugs.

Re: Encoding agnostic functions for codepoints + an iterator

« Reply #15 on: June 27, 2016, 05:04:29 pm »

Quote from: BeniBela on June 27, 2016, 03:53:38 pm

Or the one I linked above
Too bad it is outdated now too

If I understand right your Internet Tools are for UTF-8 only. There is some overlap with LazUtils units.
Does it really support all Unicode combining rules (minus the latest additions)? How to make an enumerator for combined characters?

Quote

They released Unicode 9 last week
There are some new combining characters

Ok, that is not serious. Should be easy to fix.

Logged

Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

BeniBela

Hero Member
Posts: 906

Re: Encoding agnostic functions for codepoints + an iterator

« Reply #16 on: June 27, 2016, 05:36:49 pm »

Quote from: JuhaManninen on June 27, 2016, 05:04:29 pm

If I understand right your Internet Tools are for UTF-8 only.

yes

Quote from: JuhaManninen on June 27, 2016, 05:04:29 pm

Does it really support all Unicode combining rules (minus the latest additions)?

yes, for those character that can be combined into single codepoint.

There are two functions utf8proc_NFD and utf8proc_NFC to convert between decomposed and precomposed form (e.g. transforming between ̈ a and ä, additional marks like on ḁ̩̬̪̆̃̊́ would remain unchanged).
bbnormalizeunicode is based on theo's utf8proc port where I removed everything not needed for combining characters (got rid of 500kb of tables) and independent of all the other units in the repository.

Logged

https://www.benibela.de/index_en.html
https://github.com/benibela

JuhaManninen

Global Moderator
Hero Member
Posts: 4467
I like bugs.

Re: Encoding agnostic functions for codepoints + an iterator

« Reply #17 on: June 27, 2016, 08:34:51 pm »

Quote from: BeniBela on June 27, 2016, 05:36:49 pm

yes, for those character that can be combined into single codepoint.

You mean the precomposed characters that can be represented as one codepoint, or as a codepoint for an alphabet followed by an accent mark.
That is only a subset of all combined characters.
Different exotic languages have different rules for it. Some need big tables for the info, some are algorithmic. I don't know the details and I doubt I will learn them all.

It would be nice if your Unicode stuff was isolated and usable from other projects, too.
Now there is a separate package internettools_utf8 containing one unit, but it depends on internettools package.
The dependency should be the other way around.

bbunicodeinfo and bbnormalizeunicode have duplicate implementations of utf8proc_NFD and utf8proc_NFC functions.
I will study this more ...

Logged

Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

BeniBela

Hero Member
Posts: 906

Re: Encoding agnostic functions for codepoints + an iterator

« Reply #18 on: June 27, 2016, 09:59:54 pm »

Quote from: JuhaManninen on June 27, 2016, 08:34:51 pm

You mean the precomposed characters that can be represented as one codepoint, or as a codepoint for an alphabet followed by an accent mark.
That is only a subset of all combined characters.

Everything that has a composed codepoint

Quote from: JuhaManninen on June 27, 2016, 08:34:51 pm

It would be nice if your Unicode stuff was isolated and usable from other projects, too.
Now there is a separate package internettools_utf8 containing one unit, but it depends on internettools package.
The dependency should be the other way around.

I do not use packages, so just copy the pas/inc.

And there is a table generator in ruby: https://github.com/benibela/utf8proc/

internettools_utf8 used to have other dependencies. Now that I have my own utf8 functions, I will remove it altogether .

Quote from: JuhaManninen on June 27, 2016, 08:34:51 pm

bbunicodeinfo and bbnormalizeunicode have duplicate implementations of utf8proc_NFD and utf8proc_NFC functions.
I will study this more ...

Somewhere on the web is unicodeinfo.pas, which is theo's port of utf8proc

bbunicodeinfo is unicodeinfo updated to the newest utf8proc version with unicode 8

bbnormalizeunicode is bbunicodeinfo with everything removed except the de/composition stuff

This also lead to three compile modes of the internet tools. bbnormalizeunicode together with the tables in BeRo's FLRE for upper/lower case (we have come full circle), bbunicodeinfo or theo's unicodeinfo.

Logged

https://www.benibela.de/index_en.html
https://github.com/benibela

BeRo

New Member
Posts: 45

Re: Encoding agnostic functions for codepoints + an iterator

« Reply #19 on: June 28, 2016, 05:39:08 am »

PUCU is updated to Unicode 9.0.0 now on GitHub.

And furthermore, you can update PUCU always yourself, just download the most Unicode dataset of ftp://www.unicode.org/Public/UCD/latest/ucd/ and ftp://www.unicode.org/Public/UCD/latest/ucd/extracted/ into the PUCU src/UnicodeData sub-directory and then rebuild PUCUConvertUnicode.dpr and execute it and then rebuild PUCUBuild.dpr and execute it and then you'll have already a PUCU.pas with current Unicode data tables.

Logged

serbod

Full Member
Posts: 142

Re: Encoding agnostic functions for codepoints + an iterator

« Reply #20 on: June 28, 2016, 01:39:20 pm »

Quote from: JuhaManninen on June 27, 2016, 12:12:15 pm

Then I made an enumerator for combined accented Unicode characters. I realize this is a can of worms because the rules for combining are so complex. Accented characters are only a subset.
Thus an external library like this one from BeRo would be ideal for combined codepoint stuff.

Such big library must be optional.

For start, minimal toolset for common use needed:
- codepoints iterator/array
- conversion (ANSI/UTF-8/UTF-16/UTF-32) and normalize functions (C/D/KC/KD)
- general functions (search/replace/compare/case conversion/trim/copy/delete)
- get codepoint and whole string metadata info functions (encoding, category, plane, block, script, etc..)

Some of them already provided by operating system, may be with some autoconversion (UTF-8 > UTF16 for Windows). Some can be simplified (for WGL4/W1G). No need to pull whole unicode universe into common application. If you use encoding-agnostic function, you don't care about conversion and details, visible result is same.

So, we don't need decomposed unicode iterator - just str.normalize(D) before codepoints iterator.

Logged

JuhaManninen

Global Moderator
Hero Member
Posts: 4467
I like bugs.

Re: Encoding agnostic functions for codepoints + an iterator

« Reply #21 on: July 12, 2016, 03:16:53 pm »

I have committed the units for encoding agnostic code into LazUtils package in Lazarus trunk.
See:
http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#CodePoint_functions_for_encoding_agnostic_code

It deals with CodePoints and supports Delphi, too.

Logged

Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Lazarus

Bookstore

Search

Recent

Author Topic: Encoding agnostic functions for codepoints + an iterator (Read 13214 times)

JuhaManninen

Re: Encoding agnostic functions for codepoints + an iterator

BeniBela

Re: Encoding agnostic functions for codepoints + an iterator

JuhaManninen

Re: Encoding agnostic functions for codepoints + an iterator

BeniBela

Re: Encoding agnostic functions for codepoints + an iterator

BeRo

Re: Encoding agnostic functions for codepoints + an iterator

serbod

Re: Encoding agnostic functions for codepoints + an iterator

JuhaManninen

Re: Encoding agnostic functions for codepoints + an iterator

	Computer Math and Games in Pascal (preview)
	Lazarus Handbook