Then I made an enumerator for combined accented Unicode characters. I realize this is a can of worms because the rules for combining are so complex. Accented characters are only a subset.
Thus an external library like this one from BeRo would be ideal for combined codepoint stuff.
Such big library must be optional.
For start, minimal toolset for common use needed:
- codepoints iterator/array
- conversion (ANSI/UTF-8/UTF-16/UTF-32) and normalize functions (C/D/KC/KD)
- general functions (search/replace/compare/case conversion/trim/copy/delete)
- get codepoint and whole string metadata info functions (encoding, category, plane, block, script, etc..)
Some of them already provided by operating system, may be with some autoconversion (UTF-8 > UTF16 for Windows). Some can be simplified (for WGL4/W1G). No need to pull whole unicode universe into common application. If you use encoding-agnostic function, you don't care about conversion and details, visible result is same.
So, we don't need decomposed unicode iterator - just str.normalize(D) before codepoints iterator.