[SOLVED] How to split UTF8-strings into it's characters (not bytes)?

Hartmut

Hero Member
Posts: 749

[SOLVED] How to split UTF8-strings into it's characters (not bytes)?

« on: September 24, 2020, 12:12:58 pm »

AFAIK UTF8-characters can have a length of 1 to 4 bytes. Correct?

Is there a predefined type to store 1 UTF8-character? Or must I use something like string[4]? The only thing I found is 'system.WideChar', but this can hold only 2 bytes.

I want to split an UTF8-string into it's characters (not bytes). Is there a predefined function for that?

I want to compare 2 UTF8-strings, character by character (not byte by byte). Therefore I want to iterate a loop to walk through the 2 UTF8-strings, cutting each character and then compare them.
Is there a predefined function, which returns the n'th character (not byte) of an UTF8-string?

Is there a predefined function, which returns the length (in byte) for the n'th character (not byte) of an UTF8-string?

It would help me, if you write not only the names of this functions/types but their units too (if you know them).
If possible, I want to use all this in FPC 3.0.4, but if it does only exist in a newer FPC version, this would help me too.

Thanks a lot in advance.

Added later: I want a solution for a console program. Sorry that I did not mention this earlier (I thought it would make no difference).

« Last Edit: September 26, 2020, 09:24:21 am by Hartmut »

Logged

Thaddy

Hero Member
Posts: 14373
Sensorship about opinions does not belong here.

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #1 on: September 24, 2020, 12:24:53 pm »

Quote from: Hartmut on September 24, 2020, 12:12:58 pm

AFAIK UTF8-characters can have a length of 1 to 4 bytes. Correct?

Correct

Quote

Is there a predefined type to store 1 UTF8-character? Or must I use something like string[4]? The only thing I found is 'system.WideChar', but this can hold only 2 bytes.

Yes. CP_UTF8 type for AnsiString. It is something like this:

Code: Pascal [Select][+]

type
  UTF8String = type AnsiString(CP_UTF8);

Quote

I want to split an UTF8-string into it's characters (not bytes). Is there a predefined function for that?

The simple AnsiString type has split functionality provided by sysutils (syshelph.inc). I am not quite sure if that is supposed to work for the above CP_UTF8.

Logged

Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

Bart

Hero Member
Posts: 5290

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #2 on: September 24, 2020, 12:52:51 pm »

In MaskEdit unit there is some primitive function GetCodePoint(Index: Integer).
It retrieves a single codepoint (as string[7] IIRC).
It is not fast (it did not need to be for the puropose it was written for).

The LazUtf8 unit has several functions to deal with UTF8 strings like Utf8Length(), Utf8Copy() etc.

And there is an UTF8 suite by Theo (don't know right now where to find that one), which is faster than the above mentioned functions.

Bart

Logged

wp

Hero Member
Posts: 11916

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #3 on: September 24, 2020, 12:59:47 pm »

Quote from: Hartmut on September 24, 2020, 12:12:58 pm

I want to split an UTF8-string into it's characters (not bytes). Is there a predefined function for that?

Use the enumerator in unit LazUnicode which steps from code point to code point. Something like this: (note that "ch" is a string, not a char!):

Code: Pascal [Select][+]

uses
  LazUnicode;
 
procedure TForm1.Button1Click(Sender: TObject);
var
  s: String;
  ch: String;
  n: Integer;
begin
  n := 0;
  s := 'Hätte-hätte-Fahrradkette';
  for ch in s do
    if Length(ch) > 1 then inc(n);
  ShowMessage('The string "' + s + '" contains ' + IntToStr(n) + ' non-ASCII characters.');
end;

Quote from: Hartmut on September 24, 2020, 12:12:58 pm

I want to compare 2 UTF8-strings, character by character (not byte by byte).

There is a UTF8-string compare function in unit LazUTF8: UTF8CompareStr(s1, s2).

Code: Pascal [Select][+]

procedure TForm1.Button2Click(Sender: TObject);
var
  s1, s2: String;
  res: Integer;
  resChar: String;
begin
  s1 := 'Hätte';
  s2 := 'Hütte';
  res := UTF8CompareStr(s1, s2);
  if res < 0 then resChar := '<' else if res > 0 then resChar := '>' else resChar := '=';
  ShowMessage(Format('"%s" %s "%s"', [s1, resChar, s2]));
end;

Quote from: Hartmut on September 24, 2020, 12:12:58 pm

Is there a predefined function, which returns the n'th character (not byte) of an UTF8-string?

You probably mean UTF8Pos() (in unit LazUTF8):

Code: Pascal [Select][+]

procedure TForm1.Button3Click(Sender: TObject);
var
  s: String;
  p1, p2: Integer;
begin
  s := 'Hütte';
  p1 := UTF8Pos('ü', s);
  p2 := UTF8Pos('t', s);
  ShowMessage(Format('In string "%s", the character "%s" is at position %d, and "%s" is at position %d',
    [s, 'ü', p1, string('t'), p2]));
end; 

« Last Edit: September 24, 2020, 01:55:25 pm by wp »

Logged

Martin_fr

Administrator
Hero Member
Posts: 9870
Debugger - SynEdit - and more

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #4 on: September 24, 2020, 01:41:22 pm »

Quote from: Hartmut on September 24, 2020, 12:12:58 pm

I want to compare 2 UTF8-strings, character by character (not byte by byte). Therefore I want to iterate a loop to walk through the 2 UTF8-strings, cutting each character and then compare them.

First of all, you need to be aware of the difference between codepoint and character.
As some of the replies are about codepoints.

In many cases the 2 are the same. But not always.

Lets look at "ä".
In utf8 you will find the single codepoint 0xC3 0xA4
But the following 2 codepoints describe the same letter: 0x61 0xCC 0x88
The 2nd is the decomposed form.

Some chars only exist as combination of one or more codepoints.
And it is possible to construct single characters, that exists of hundreds of codepoints (though they do not exist in real languages).

There are also surrogate codepoints. Those can only exist in pairs. (But they are 2 codepoints)

Note that combining and surrogates exist in utf8 AND utf16 (widestring).
In utf32 afaik there are no surrogates, but combining still exists.

And last of all, a character is not necessarily equal to a glyph. Sometimes several chars are printed as one token (e.g. ligatures).

In most cases you will be fine by working with codepoints. But you need to decide for yourself.

Logged

From the wiki: Ide Tools, Code completion and more / IDE cool features / Debugger Status

Aidex

Jr. Member
Posts: 82

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #5 on: September 24, 2020, 02:33:22 pm »

UCS4Char is a 32 bit char, UCS4String is an array of UCS4Char.

function UnicodeStringToUCS4String(const s: UnicodeString): UCS4String;

The function converts a Unicode string to a UCS-4 encoded string.

https://www.freepascal.org/docs-html/rtl/system/unicodestringtoucs4string.html

« Last Edit: September 24, 2020, 02:38:43 pm by Aidex »

Logged

Martin_fr

Administrator
Hero Member
Posts: 9870
Debugger - SynEdit - and more

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #6 on: September 24, 2020, 02:38:37 pm »

Quote from: Aidex on September 24, 2020, 02:33:22 pm

UCS4Char is a 32 bit char, UCS4String is an array of UCS4Char.

Still has combining codepoints.

It is called char, but it is a codepoint. And an actual char may be several codepoints

Logged

From the wiki: Ide Tools, Code completion and more / IDE cool features / Debugger Status

Blaazen

Hero Member
Posts: 3237
POKE 54296,15

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #7 on: September 24, 2020, 03:26:17 pm »

Here: https://wiki.freepascal.org/UTF8_Tools is wiki for UTF8Tools mentioned by Bart. There's download link at the bottom.

Logged

Lazarus 2.3.0 (rev main-2_3-2863...) FPC 3.3.1 x86_64-linux-qt Chakra, Qt 4.8.7/5.13.2, Plasma 5.17.3
Lazarus 1.8.2 r57369 FPC 3.0.4 i386-win32-win32/win64 Wine 3.21

Try Eye-Candy Controls: https://sourceforge.net/projects/eccontrols/files/

Martin_fr

Administrator
Hero Member
Posts: 9870
Debugger - SynEdit - and more

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #8 on: September 24, 2020, 05:57:03 pm »

Quote from: Blaazen on September 24, 2020, 03:26:17 pm

Here: https://wiki.freepascal.org/UTF8_Tools is wiki for UTF8Tools mentioned by Bart. There's download link at the bottom.

How up to date is it?

It seems to have a hardcoded datafile, and the file date is from 2008?
There are new chars added to the unicode standard every year.

Logged

From the wiki: Ide Tools, Code completion and more / IDE cool features / Debugger Status

Blaazen

Hero Member
Posts: 3237
POKE 54296,15

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #9 on: September 24, 2020, 06:48:28 pm »

@ There are new chars added to the unicode standard every year.

Honestly, I doubt they added something useful. They add mostly "Emoji symbols and symbol modifiers for implementing skin tone diversity" (from their web from 2015).

Logged

Thaddy

Hero Member
Posts: 14373
Sensorship about opinions does not belong here.

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #10 on: September 24, 2020, 06:55:56 pm »

Quote from: Blaazen on September 24, 2020, 06:48:28 pm

Honestly, I doubt they added something useful. They add mostly "Emoji symbols and symbol modifiers for implementing skin tone diversity" (from their web from 2015).

Well, you gave the answer yourself: "skin tone diversity"

Logged

Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

Blaazen

Hero Member
Posts: 3237
POKE 54296,15

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #11 on: September 24, 2020, 07:10:19 pm »

And AFAIK font desgners do not implement it all, see https://stackoverflow.com/questions/34732718/why-isnt-there-a-font-that-contains-all-unicode-glyphs

Quote

So: "Why isn't there a font that contains all Unicode glyphs?", because that's been technically impossible since 2001.

Logged

Hartmut

Hero Member
Posts: 749

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #12 on: September 24, 2020, 09:56:48 pm »

Thanks a lot to all for your many replies. I will answer them ony by one.

@Thaddy:
Yes, variables of type AnsiString have a split functionality by TStringHelper, but variables of your type 'UTF8String' unfortunately seem not to have this (FPC 3.0.4).
And all this split functions need a 'separator'. But I want to split an UTF8-string into all characters, so how could this work?

@Bart:
I tried to use Unit MaskEdit, but the Compiler showed me many Compiler-Errors like:

Code: [Select]

/usr/share/lazarus/1.8.4/lcl/units/x86_64-linux/wsmenus.o: In Funktion »REGISTERMENUITEM«:
/home/mattias/tmp/lazarus-project1.8.4/lazarus-project_build/usr/share/lazarus/1.8.4/lcl//widgetset/wsmenus.pp:221: Warnung: undefinierter Verweis auf »WSRegisterMenuItem«
/usr/share/lazarus/1.8.4/lcl/units/x86_64-linux/wsmenus.o: In Funktion »REGISTERMENU«:
/home/mattias/tmp/lazarus-project1.8.4/lazarus-project_build/usr/share/lazarus/1.8.4/lcl//widgetset/wsmenus.pp:232: Warnung: undefinierter Verweis auf »WSRegisterMenu«

Strange is, that I don't have a folder /home/mattias/ and never had.

But then I looked into Unit MaskEdit and saw, that this is a GUI Unit, while I want a solution for a console program. Sorry that I did not mention this (I thought it would make no difference).

In Unit MaskEdit I found your recommended function GetCodePoint() and first thought, I could make a copy of it, but it needs Unit LazUTF8, which I want to avoid, because:

With Unit LazUTF8 I faced a lot of problems and disadvantages in the past. Some Examples I remember immediately:
- on Windows in console programs adding Unit LazUTF8 changes the charset of the filenames reported by sysutils.FindFirst() and FindNext() and the charset of the results of ParamStr() and the results of readln(). Without Unit LazUTF8 they return Windows-Charset (Ansi 1252?) / with Unit LazUTF8 they return UTF8. This difference makes live not easy.
- during decades I have written a couple of common libraries for console programs, which partly have problems with this differing charsets, so I get wrong results with Unit LazUTF8 because of the changed charset
- for a couple of programs I still use FPC 2.6.4 (with the same common libraries), but in FPC 2.6.4 obove functions never return UTF8 on Windows
- Windows-charset generally is much easier than UTF8 (as we see now), because each char is only 1 byte long
- in some cases (but not always) I had problems, that writeln() for an 'ansistring' showed damaged characters for Ä Ö Ü ä ö ü ß after I added Unit LazUTF8.
And there were more issues, which I only don't remember in a sudden. So I want to avoid Unit LazUTF8, especially in console programs, wherever possible.

Question:
Is Unit LazUTF8 the only way in FPC to have such primitive UTF8-functions which I want now?

@wp: Thank you for your demos.
Your 1st demo = TForm1.Button1Click() works in a GUI program, but not in a console program (FPC 3.0.4). There Length(ch) is always = 1. Do you have an idea why? Currently I want this in a console program. Sorry that I did not mention this (I thought it would make no difference).
Correction: now it works in a console program too. My fault. Sorry for confusion.

Your 2nd demo = TForm1.Button2Click() works also in a console program. But I want to compare 2 UTF8-strings in a loop, character by character, so that I can do some action for every character, depending if both are equal or not. So I need a function, which returns the n'th character of an UTF8-string. What I found is UTF8Copy(), but it needs Unit LazUTF8, which I want to avoid if possible (see above) for such a primitive usage.

In your 3nd demo = TForm1.Button3Click() we misunderstood: what I searched was something similar to LazUTF8.UTF8Copy().

@all:
For tonight I must stop. I will check the other replies tomorrow and answer then. Have a good night.

« Last Edit: September 25, 2020, 08:55:22 am by Hartmut »

Logged

PascalDragon

Hero Member
Posts: 5481
Compiler Developer

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #13 on: September 24, 2020, 10:42:41 pm »

Quote from: Blaazen on September 24, 2020, 06:48:28 pm

@ There are new chars added to the unicode standard every year.

Honestly, I doubt they added something useful. They add mostly "Emoji symbols and symbol modifiers for implementing skin tone diversity" (from their web from 2015).

The additions of the most recent Unicode version (namely 13.0 from March 2020) are listed here.

Logged

Martin_fr

Administrator
Hero Member
Posts: 9870
Debugger - SynEdit - and more

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #14 on: September 24, 2020, 11:13:12 pm »

Quote

Strange is, that I don't have a folder /home/mattias/ and never had.

But Mattias has....

When the installer is build, the ppu will contain the path to the unit source, as it is on the machine that was used to build the installer.

So unless you recompile those units, the compiler thinks that is the path where the unit is (or used to be).

Logged

From the wiki: Ide Tools, Code completion and more / IDE cool features / Debugger Status

Lazarus

Bookstore

Search

Recent

Author Topic: [SOLVED] How to split UTF8-strings into it's characters (not bytes)? (Read 10879 times)

Hartmut

[SOLVED] How to split UTF8-strings into it's characters (not bytes)?

Thaddy

Re: How to split UTF8-strings into it's characters (not bytes)?

Bart

Re: How to split UTF8-strings into it's characters (not bytes)?

wp

Re: How to split UTF8-strings into it's characters (not bytes)?

Martin_fr

Re: How to split UTF8-strings into it's characters (not bytes)?

Aidex

Re: How to split UTF8-strings into it's characters (not bytes)?

Martin_fr

Re: How to split UTF8-strings into it's characters (not bytes)?

Blaazen

Re: How to split UTF8-strings into it's characters (not bytes)?

Martin_fr

Re: How to split UTF8-strings into it's characters (not bytes)?

Blaazen

Re: How to split UTF8-strings into it's characters (not bytes)?

Thaddy

Re: How to split UTF8-strings into it's characters (not bytes)?

Blaazen

Re: How to split UTF8-strings into it's characters (not bytes)?

Hartmut

Re: How to split UTF8-strings into it's characters (not bytes)?

PascalDragon

Re: How to split UTF8-strings into it's characters (not bytes)?

Martin_fr

Re: How to split UTF8-strings into it's characters (not bytes)?

	Computer Math and Games in Pascal (preview)
	Lazarus Handbook