Recent

Author Topic: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?  (Read 3421 times)

Hartmut

  • Sr. Member
  • ****
  • Posts: 439
AFAIK UTF8-characters can have a length of 1 to 4 bytes. Correct?

Is there a predefined type to store 1 UTF8-character? Or must I use something like string[4]? The only thing I found is 'system.WideChar', but this can hold only 2 bytes.

I want to split an UTF8-string into it's characters (not bytes). Is there a predefined function for that?

I want to compare 2 UTF8-strings, character by character (not byte by byte). Therefore I want to iterate a loop to walk through the 2 UTF8-strings, cutting each character and then compare them.
Is there a predefined function, which returns the n'th character (not byte) of an UTF8-string?

Is there a predefined function, which returns the length (in byte) for the n'th character (not byte) of an UTF8-string?

It would help me, if you write not only the names of this functions/types but their units too (if you know them).
If possible, I want to use all this in FPC 3.0.4, but if it does only exist in a newer FPC version, this would help me too.

Thanks a lot in advance.

Added later: I want a solution for a console program. Sorry that I did not mention this earlier (I thought it would make no difference).
« Last Edit: September 26, 2020, 09:24:21 am by Hartmut »

Thaddy

  • Hero Member
  • *****
  • Posts: 10572
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #1 on: September 24, 2020, 12:24:53 pm »
AFAIK UTF8-characters can have a length of 1 to 4 bytes. Correct?
Correct
Quote
Is there a predefined type to store 1 UTF8-character? Or must I use something like string[4]? The only thing I found is 'system.WideChar', but this can hold only 2 bytes.
Yes. CP_UTF8 type for AnsiString. It is something like this:
Code: Pascal  [Select][+][-]
  1. type
  2.   UTF8String = type AnsiString(CP_UTF8);
Quote
I want to split an UTF8-string into it's characters (not bytes). Is there a predefined function for that?
The simple AnsiString type has split functionality provided by sysutils (syshelph.inc). I am not quite sure if that is supposed to work for the above CP_UTF8.

Bart

  • Hero Member
  • *****
  • Posts: 4030
    • Bart en Mariska's Webstek
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #2 on: September 24, 2020, 12:52:51 pm »
In MaskEdit unit there is some primitive function GetCodePoint(Index: Integer).
It retrieves a single codepoint (as string[7] IIRC).
It is not fast (it did not need to be for the puropose it was written for).

The LazUtf8 unit has several functions to deal with UTF8 strings like Utf8Length(), Utf8Copy() etc.

And there is an UTF8 suite by Theo (don't know right now where to find that one), which is faster than the above mentioned functions.

Bart

wp

  • Hero Member
  • *****
  • Posts: 7729
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #3 on: September 24, 2020, 12:59:47 pm »
I want to split an UTF8-string into it's characters (not bytes). Is there a predefined function for that?
Use the enumerator in unit LazUnicode which steps from code point to code point. Something like this: (note that "ch" is a string, not a char!):
Code: Pascal  [Select][+][-]
  1. uses
  2.   LazUnicode;
  3.  
  4. procedure TForm1.Button1Click(Sender: TObject);
  5. var
  6.   s: String;
  7.   ch: String;
  8.   n: Integer;
  9. begin
  10.   n := 0;
  11.   s := 'Hätte-hätte-Fahrradkette';
  12.   for ch in s do
  13.     if Length(ch) > 1 then inc(n);
  14.   ShowMessage('The string "' + s + '" contains ' + IntToStr(n) + ' non-ASCII characters.');
  15. end;

I want to compare 2 UTF8-strings, character by character (not byte by byte).
There is a UTF8-string compare function in unit LazUTF8: UTF8CompareStr(s1, s2).
Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button2Click(Sender: TObject);
  2. var
  3.   s1, s2: String;
  4.   res: Integer;
  5.   resChar: String;
  6. begin
  7.   s1 := 'Hätte';
  8.   s2 := 'Hütte';
  9.   res := UTF8CompareStr(s1, s2);
  10.   if res < 0 then resChar := '<' else if res > 0 then resChar := '>' else resChar := '=';
  11.   ShowMessage(Format('"%s" %s "%s"', [s1, resChar, s2]));
  12. end;

Is there a predefined function, which returns the n'th character (not byte) of an UTF8-string?
You probably mean UTF8Pos() (in unit LazUTF8):
Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button3Click(Sender: TObject);
  2. var
  3.   s: String;
  4.   p1, p2: Integer;
  5. begin
  6.   s := 'Hütte';
  7.   p1 := UTF8Pos('ü', s);
  8.   p2 := UTF8Pos('t', s);
  9.   ShowMessage(Format('In string "%s", the character "%s" is at position %d, and "%s" is at position %d',
  10.     [s, 'ü', p1, string('t'), p2]));
  11. end;
« Last Edit: September 24, 2020, 01:55:25 pm by wp »
Mainly Lazarus trunk / fpc 3.2.0 / all 32-bit on Win-10, but many more...

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 6708
  • Debugger - SynEdit - and more
    • wiki
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #4 on: September 24, 2020, 01:41:22 pm »
I want to compare 2 UTF8-strings, character by character (not byte by byte). Therefore I want to iterate a loop to walk through the 2 UTF8-strings, cutting each character and then compare them.

First of all, you need to be aware of the difference between codepoint and character.
As some of the replies are about codepoints.

In many cases the 2 are the same. But not always.

Lets look at "ä".
In utf8 you will find the single codepoint 0xC3 0xA4 
But the following 2 codepoints describe the same letter:  0x61    0xCC 0x88
The 2nd is the decomposed form.

Some chars only exist as combination of one or more codepoints.
And it is possible to construct single characters, that exists of hundreds of codepoints (though they do not exist in real languages).

There are also surrogate codepoints. Those can only exist in pairs. (But they are 2 codepoints)

Note that combining and surrogates exist in utf8 AND utf16 (widestring).
In utf32 afaik there are no surrogates, but combining still exists.

And last of all, a character is not necessarily equal to a glyph. Sometimes several chars are printed as one token (e.g. ligatures).


In most cases you will be fine by working with codepoints. But you need to decide for yourself.

Aidex

  • Jr. Member
  • **
  • Posts: 66
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #5 on: September 24, 2020, 02:33:22 pm »
UCS4Char is a 32 bit char, UCS4String is an array of UCS4Char.

function UnicodeStringToUCS4String(const s: UnicodeString): UCS4String;

The function converts a Unicode string to a UCS-4 encoded string.

https://www.freepascal.org/docs-html/rtl/system/unicodestringtoucs4string.html
« Last Edit: September 24, 2020, 02:38:43 pm by Aidex »

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 6708
  • Debugger - SynEdit - and more
    • wiki
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #6 on: September 24, 2020, 02:38:37 pm »
UCS4Char is a 32 bit char, UCS4String is an array of UCS4Char.

Still has combining codepoints.

It is called char, but it is a codepoint. And an actual char may be several codepoints

Blaazen

  • Hero Member
  • *****
  • Posts: 2994
  • POKE 54296,15
    • Eye-Candy Controls
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #7 on: September 24, 2020, 03:26:17 pm »
Here: https://wiki.freepascal.org/UTF8_Tools is wiki for UTF8Tools mentioned by Bart. There's download link at the bottom.
Lazarus 2.1.0 r64115 FPC 3.3.1 r40507 x86_64-linux-qt Chakra, Qt 4.8.7/5.13.2, Plasma 5.17.3
Lazarus 1.8.2 r57369 FPC 3.0.4 i386-win32-win32/win64 Wine 3.21

Try Eye-Candy Controls: https://sourceforge.net/projects/eccontrols/files/

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 6708
  • Debugger - SynEdit - and more
    • wiki
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #8 on: September 24, 2020, 05:57:03 pm »
Here: https://wiki.freepascal.org/UTF8_Tools is wiki for UTF8Tools mentioned by Bart. There's download link at the bottom.
How up to date is it?

It seems to have a hardcoded datafile, and the file date is from 2008?
There are new chars added to the unicode standard every year.

Blaazen

  • Hero Member
  • *****
  • Posts: 2994
  • POKE 54296,15
    • Eye-Candy Controls
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #9 on: September 24, 2020, 06:48:28 pm »
@ There are new chars added to the unicode standard every year.

Honestly, I doubt they added something useful. They add mostly "Emoji symbols and symbol modifiers for implementing skin tone diversity" (from their web from 2015).
Lazarus 2.1.0 r64115 FPC 3.3.1 r40507 x86_64-linux-qt Chakra, Qt 4.8.7/5.13.2, Plasma 5.17.3
Lazarus 1.8.2 r57369 FPC 3.0.4 i386-win32-win32/win64 Wine 3.21

Try Eye-Candy Controls: https://sourceforge.net/projects/eccontrols/files/

Thaddy

  • Hero Member
  • *****
  • Posts: 10572
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #10 on: September 24, 2020, 06:55:56 pm »
Honestly, I doubt they added something useful. They add mostly "Emoji symbols and symbol modifiers for implementing skin tone diversity" (from their web from 2015).
Well, you gave the answer yourself: "skin tone diversity"

Blaazen

  • Hero Member
  • *****
  • Posts: 2994
  • POKE 54296,15
    • Eye-Candy Controls
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #11 on: September 24, 2020, 07:10:19 pm »
And AFAIK font desgners do not implement it all, see https://stackoverflow.com/questions/34732718/why-isnt-there-a-font-that-contains-all-unicode-glyphs

Quote
So: "Why isn't there a font that contains all Unicode glyphs?", because that's been technically impossible since 2001.
Lazarus 2.1.0 r64115 FPC 3.3.1 r40507 x86_64-linux-qt Chakra, Qt 4.8.7/5.13.2, Plasma 5.17.3
Lazarus 1.8.2 r57369 FPC 3.0.4 i386-win32-win32/win64 Wine 3.21

Try Eye-Candy Controls: https://sourceforge.net/projects/eccontrols/files/

Hartmut

  • Sr. Member
  • ****
  • Posts: 439
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #12 on: September 24, 2020, 09:56:48 pm »
Thanks a lot to all for your many replies. I will answer them ony by one.

@Thaddy:
Yes, variables of type AnsiString have a split functionality by TStringHelper, but variables of your type 'UTF8String' unfortunately seem not to have this (FPC 3.0.4).
And all this split functions need a 'separator'. But I want to split an UTF8-string into all characters, so how could this work?

@Bart:
I tried to use Unit MaskEdit, but the Compiler showed me many Compiler-Errors like:
Code: [Select]
/usr/share/lazarus/1.8.4/lcl/units/x86_64-linux/wsmenus.o: In Funktion »REGISTERMENUITEM«:
/home/mattias/tmp/lazarus-project1.8.4/lazarus-project_build/usr/share/lazarus/1.8.4/lcl//widgetset/wsmenus.pp:221: Warnung: undefinierter Verweis auf »WSRegisterMenuItem«
/usr/share/lazarus/1.8.4/lcl/units/x86_64-linux/wsmenus.o: In Funktion »REGISTERMENU«:
/home/mattias/tmp/lazarus-project1.8.4/lazarus-project_build/usr/share/lazarus/1.8.4/lcl//widgetset/wsmenus.pp:232: Warnung: undefinierter Verweis auf »WSRegisterMenu«
Strange is, that I don't have a folder /home/mattias/ and never had.

But then I looked into Unit MaskEdit and saw, that this is a GUI Unit, while I want a solution for a console program. Sorry that I did not mention this (I thought it would make no difference).

In Unit MaskEdit I found your recommended function GetCodePoint() and first thought, I could make a copy of it, but it needs Unit LazUTF8, which I want to avoid, because:

With Unit LazUTF8 I faced a lot of problems and disadvantages in the past. Some Examples I remember immediately:
 - on Windows in console programs adding Unit LazUTF8 changes the charset of the filenames reported by sysutils.FindFirst() and FindNext() and the charset of the results of ParamStr() and the results of readln(). Without Unit LazUTF8 they return Windows-Charset (Ansi 1252?) / with Unit LazUTF8 they return UTF8. This difference makes live not easy.
 - during decades I have written a couple of common libraries for console programs, which partly have problems with this differing charsets, so I get wrong results with Unit LazUTF8 because of the changed charset
 - for a couple of programs I still use FPC 2.6.4 (with the same common libraries), but in FPC 2.6.4 obove functions never return UTF8 on Windows
 - Windows-charset generally is much easier than UTF8 (as we see now), because each char is only 1 byte long
 - in some cases (but not always) I had problems, that writeln() for an 'ansistring' showed damaged characters for Ä Ö Ü ä ö ü ß after I added Unit LazUTF8.
And there were more issues, which I only don't remember in a sudden. So I want to avoid Unit LazUTF8, especially in console programs, wherever possible.

Question:
Is Unit LazUTF8 the only way in FPC to have such primitive UTF8-functions which I want now?

@wp: Thank you for your demos.
Your 1st demo = TForm1.Button1Click() works in a GUI program, but not in a console program (FPC 3.0.4). There Length(ch) is always = 1. Do you have an idea why? Currently I want this in a console program. Sorry that I did not mention this (I thought it would make no difference).
Correction: now it works in a console program too. My fault. Sorry for confusion.

Your 2nd demo = TForm1.Button2Click() works also in a console program. But I want to compare 2 UTF8-strings in a loop, character by character, so that I can do some action for every character, depending if both are equal or not. So I need a function, which returns the n'th character of an UTF8-string. What I found is UTF8Copy(), but it needs Unit LazUTF8, which I want to avoid if possible (see above) for such a primitive usage.

In your 3nd demo = TForm1.Button3Click() we misunderstood: what I searched was something similar to LazUTF8.UTF8Copy().

@all:
For tonight I must stop. I will check the other replies tomorrow and answer then. Have a good night.
« Last Edit: September 25, 2020, 08:55:22 am by Hartmut »

PascalDragon

  • Hero Member
  • *****
  • Posts: 2405
  • Compiler Developer
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #13 on: September 24, 2020, 10:42:41 pm »
@ There are new chars added to the unicode standard every year.

Honestly, I doubt they added something useful. They add mostly "Emoji symbols and symbol modifiers for implementing skin tone diversity" (from their web from 2015).

The additions of the most recent Unicode version (namely 13.0 from March 2020) are listed here.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 6708
  • Debugger - SynEdit - and more
    • wiki
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #14 on: September 24, 2020, 11:13:12 pm »
Quote
Strange is, that I don't have a folder /home/mattias/ and never had.
But Mattias has....

When the installer is build, the ppu will contain the path to the unit source, as it is on the machine that was used to build the installer.

So unless you recompile those units, the compiler thinks that is the path where the unit is (or used to be).

 

TinyPortal © 2005-2018