Print Page - [SOLVED] How to split UTF8-strings into it's characters (not bytes)?

Programming => General => Topic started by: Hartmut on September 24, 2020, 12:12:58 pm

Title: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?
Post by: Hartmut on September 24, 2020, 12:12:58 pm

AFAIK UTF8-characters can have a length of 1 to 4 bytes. Correct?

Is there a predefined type to store 1 UTF8-character? Or must I use something like string[4]? The only thing I found is 'system.WideChar', but this can hold only 2 bytes.

I want to split an UTF8-string into it's characters (not bytes). Is there a predefined function for that?

I want to compare 2 UTF8-strings, character by character (not byte by byte). Therefore I want to iterate a loop to walk through the 2 UTF8-strings, cutting each character and then compare them.
Is there a predefined function, which returns the n'th character (not byte) of an UTF8-string?

Is there a predefined function, which returns the length (in byte) for the n'th character (not byte) of an UTF8-string?

It would help me, if you write not only the names of this functions/types but their units too (if you know them).
If possible, I want to use all this in FPC 3.0.4, but if it does only exist in a newer FPC version, this would help me too.

Thanks a lot in advance.

Added later: I want a solution for a console program. Sorry that I did not mention this earlier (I thought it would make no difference).

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: Thaddy on September 24, 2020, 12:24:53 pm

Quote from: Hartmut on September 24, 2020, 12:12:58 pm

AFAIK UTF8-characters can have a length of 1 to 4 bytes. Correct?

Correct

Quote

Is there a predefined type to store 1 UTF8-character? Or must I use something like string[4]? The only thing I found is 'system.WideChar', but this can hold only 2 bytes.

Yes. CP_UTF8 type for AnsiString. It is something like this:

Code: Pascal [Select][+]

type
  UTF8String = type AnsiString(CP_UTF8);

Quote

I want to split an UTF8-string into it's characters (not bytes). Is there a predefined function for that?

The simple AnsiString type has split functionality provided by sysutils (syshelph.inc). I am not quite sure if that is supposed to work for the above CP_UTF8.

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: Bart on September 24, 2020, 12:52:51 pm

In MaskEdit unit there is some primitive function GetCodePoint(Index: Integer).
It retrieves a single codepoint (as string[7] IIRC).
It is not fast (it did not need to be for the puropose it was written for).

The LazUtf8 unit has several functions to deal with UTF8 strings like Utf8Length(), Utf8Copy() etc.

And there is an UTF8 suite by Theo (don't know right now where to find that one), which is faster than the above mentioned functions.

Bart

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: wp on September 24, 2020, 12:59:47 pm

Quote from: Hartmut on September 24, 2020, 12:12:58 pm

I want to split an UTF8-string into it's characters (not bytes). Is there a predefined function for that?

Use the enumerator in unit LazUnicode which steps from code point to code point. Something like this: (note that "ch" is a string, not a char!):

Code: Pascal [Select][+]

uses
  LazUnicode;
 
procedure TForm1.Button1Click(Sender: TObject);
var
  s: String;
  ch: String;
  n: Integer;
begin
  n := 0;
  s := 'Hätte-hätte-Fahrradkette';
  for ch in s do
    if Length(ch) > 1 then inc(n);
  ShowMessage('The string "' + s + '" contains ' + IntToStr(n) + ' non-ASCII characters.');
end;

Quote from: Hartmut on September 24, 2020, 12:12:58 pm

I want to compare 2 UTF8-strings, character by character (not byte by byte).

There is a UTF8-string compare function in unit LazUTF8: UTF8CompareStr(s1, s2).

Code: Pascal [Select][+]

procedure TForm1.Button2Click(Sender: TObject);
var
  s1, s2: String;
  res: Integer;
  resChar: String;
begin
  s1 := 'Hätte';
  s2 := 'Hütte';
  res := UTF8CompareStr(s1, s2);
  if res < 0 then resChar := '<' else if res > 0 then resChar := '>' else resChar := '=';
  ShowMessage(Format('"%s" %s "%s"', [s1, resChar, s2]));
end;

Quote from: Hartmut on September 24, 2020, 12:12:58 pm

Is there a predefined function, which returns the n'th character (not byte) of an UTF8-string?

You probably mean UTF8Pos() (in unit LazUTF8):

Code: Pascal [Select][+]

procedure TForm1.Button3Click(Sender: TObject);
var
  s: String;
  p1, p2: Integer;
begin
  s := 'Hütte';
  p1 := UTF8Pos('ü', s);
  p2 := UTF8Pos('t', s);
  ShowMessage(Format('In string "%s", the character "%s" is at position %d, and "%s" is at position %d',
    [s, 'ü', p1, string('t'), p2]));
end; 

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: Martin_fr on September 24, 2020, 01:41:22 pm

Quote from: Hartmut on September 24, 2020, 12:12:58 pm

I want to compare 2 UTF8-strings, character by character (not byte by byte). Therefore I want to iterate a loop to walk through the 2 UTF8-strings, cutting each character and then compare them.

First of all, you need to be aware of the difference between codepoint and character.
As some of the replies are about codepoints.

In many cases the 2 are the same. But not always.

Lets look at "ä".
In utf8 you will find the single codepoint 0xC3 0xA4
But the following 2 codepoints describe the same letter: 0x61 0xCC 0x88
The 2nd is the decomposed form.

Some chars only exist as combination of one or more codepoints.
And it is possible to construct single characters, that exists of hundreds of codepoints (though they do not exist in real languages).

There are also surrogate codepoints. Those can only exist in pairs. (But they are 2 codepoints)

Note that combining and surrogates exist in utf8 AND utf16 (widestring).
In utf32 afaik there are no surrogates, but combining still exists.

And last of all, a character is not necessarily equal to a glyph. Sometimes several chars are printed as one token (e.g. ligatures).

In most cases you will be fine by working with codepoints. But you need to decide for yourself.

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: Aidex on September 24, 2020, 02:33:22 pm

UCS4Char is a 32 bit char, UCS4String is an array of UCS4Char.

function UnicodeStringToUCS4String(const s: UnicodeString): UCS4String;

The function converts a Unicode string to a UCS-4 encoded string.

https://www.freepascal.org/docs-html/rtl/system/unicodestringtoucs4string.html

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: Martin_fr on September 24, 2020, 02:38:37 pm

Quote from: Aidex on September 24, 2020, 02:33:22 pm

UCS4Char is a 32 bit char, UCS4String is an array of UCS4Char.

Still has combining codepoints.

It is called char, but it is a codepoint. And an actual char may be several codepoints

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: Blaazen on September 24, 2020, 03:26:17 pm

Here: https://wiki.freepascal.org/UTF8_Tools (https://wiki.freepascal.org/UTF8_Tools) is wiki for UTF8Tools mentioned by Bart. There's download link at the bottom.

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: Martin_fr on September 24, 2020, 05:57:03 pm

Quote from: Blaazen on September 24, 2020, 03:26:17 pm

Here: https://wiki.freepascal.org/UTF8_Tools (https://wiki.freepascal.org/UTF8_Tools) is wiki for UTF8Tools mentioned by Bart. There's download link at the bottom.

How up to date is it?

It seems to have a hardcoded datafile, and the file date is from 2008?
There are new chars added to the unicode standard every year.

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: Blaazen on September 24, 2020, 06:48:28 pm

@ There are new chars added to the unicode standard every year.

Honestly, I doubt they added something useful. They add mostly "Emoji symbols and symbol modifiers for implementing skin tone diversity" (from their web from 2015).

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: Thaddy on September 24, 2020, 06:55:56 pm

Quote from: Blaazen on September 24, 2020, 06:48:28 pm

Honestly, I doubt they added something useful. They add mostly "Emoji symbols and symbol modifiers for implementing skin tone diversity" (from their web from 2015).

Well, you gave the answer yourself: "skin tone diversity"

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: Blaazen on September 24, 2020, 07:10:19 pm

And AFAIK font desgners do not implement it all, see https://stackoverflow.com/questions/34732718/why-isnt-there-a-font-that-contains-all-unicode-glyphs (https://stackoverflow.com/questions/34732718/why-isnt-there-a-font-that-contains-all-unicode-glyphs)

Quote

So: "Why isn't there a font that contains all Unicode glyphs?", because that's been technically impossible since 2001.

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: Hartmut on September 24, 2020, 09:56:48 pm

Thanks a lot to all for your many replies. I will answer them ony by one.

@Thaddy:
Yes, variables of type AnsiString have a split functionality by TStringHelper, but variables of your type 'UTF8String' unfortunately seem not to have this (FPC 3.0.4).
And all this split functions need a 'separator'. But I want to split an UTF8-string into all characters, so how could this work?

@Bart:
I tried to use Unit MaskEdit, but the Compiler showed me many Compiler-Errors like:

Code: [Select]

/usr/share/lazarus/1.8.4/lcl/units/x86_64-linux/wsmenus.o: In Funktion »REGISTERMENUITEM«:
/home/mattias/tmp/lazarus-project1.8.4/lazarus-project_build/usr/share/lazarus/1.8.4/lcl//widgetset/wsmenus.pp:221: Warnung: undefinierter Verweis auf »WSRegisterMenuItem«
/usr/share/lazarus/1.8.4/lcl/units/x86_64-linux/wsmenus.o: In Funktion »REGISTERMENU«:
/home/mattias/tmp/lazarus-project1.8.4/lazarus-project_build/usr/share/lazarus/1.8.4/lcl//widgetset/wsmenus.pp:232: Warnung: undefinierter Verweis auf »WSRegisterMenu«

Strange is, that I don't have a folder /home/mattias/ and never had.

But then I looked into Unit MaskEdit and saw, that this is a GUI Unit, while I want a solution for a console program. Sorry that I did not mention this (I thought it would make no difference).

In Unit MaskEdit I found your recommended function GetCodePoint() and first thought, I could make a copy of it, but it needs Unit LazUTF8, which I want to avoid, because:

With Unit LazUTF8 I faced a lot of problems and disadvantages in the past. Some Examples I remember immediately:
- on Windows in console programs adding Unit LazUTF8 changes the charset of the filenames reported by sysutils.FindFirst() and FindNext() and the charset of the results of ParamStr() and the results of readln(). Without Unit LazUTF8 they return Windows-Charset (Ansi 1252?) / with Unit LazUTF8 they return UTF8. This difference makes live not easy.
- during decades I have written a couple of common libraries for console programs, which partly have problems with this differing charsets, so I get wrong results with Unit LazUTF8 because of the changed charset
- for a couple of programs I still use FPC 2.6.4 (with the same common libraries), but in FPC 2.6.4 obove functions never return UTF8 on Windows
- Windows-charset generally is much easier than UTF8 (as we see now), because each char is only 1 byte long
- in some cases (but not always) I had problems, that writeln() for an 'ansistring' showed damaged characters for Ä Ö Ü ä ö ü ß after I added Unit LazUTF8.
And there were more issues, which I only don't remember in a sudden. So I want to avoid Unit LazUTF8, especially in console programs, wherever possible.

Question:
Is Unit LazUTF8 the only way in FPC to have such primitive UTF8-functions which I want now?

@wp: Thank you for your demos.
Your 1st demo = TForm1.Button1Click() works in a GUI program, but not in a console program (FPC 3.0.4). There Length(ch) is always = 1. Do you have an idea why? Currently I want this in a console program. Sorry that I did not mention this (I thought it would make no difference).
Correction: now it works in a console program too. My fault. Sorry for confusion.

Your 2nd demo = TForm1.Button2Click() works also in a console program. But I want to compare 2 UTF8-strings in a loop, character by character, so that I can do some action for every character, depending if both are equal or not. So I need a function, which returns the n'th character of an UTF8-string. What I found is UTF8Copy(), but it needs Unit LazUTF8, which I want to avoid if possible (see above) for such a primitive usage.

In your 3nd demo = TForm1.Button3Click() we misunderstood: what I searched was something similar to LazUTF8.UTF8Copy().

@all:
For tonight I must stop. I will check the other replies tomorrow and answer then. Have a good night.

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: PascalDragon on September 24, 2020, 10:42:41 pm

Quote from: Blaazen on September 24, 2020, 06:48:28 pm

@ There are new chars added to the unicode standard every year.

Honestly, I doubt they added something useful. They add mostly "Emoji symbols and symbol modifiers for implementing skin tone diversity" (from their web from 2015).

The additions of the most recent Unicode version (namely 13.0 from March 2020) are listed here (https://unicode.org/versions/Unicode13.0.0/).

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: Martin_fr on September 24, 2020, 11:13:12 pm

Quote

Strange is, that I don't have a folder /home/mattias/ and never had.

But Mattias has....

When the installer is build, the ppu will contain the path to the unit source, as it is on the machine that was used to build the installer.

So unless you recompile those units, the compiler thinks that is the path where the unit is (or used to be).

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: winni on September 24, 2020, 11:29:10 pm

Quote from: PascalDragon on September 24, 2020, 10:42:41 pm

The additions of the most recent Unicode version (namely 13.0 from March 2020) are listed here (https://unicode.org/versions/Unicode13.0.0/).

Yeah - Emojis are fighting for women's liberation!

Now we got MRS SANTA CLAUS !!!

Instead of abolishing the cristmas trash they create new stuff

Winni

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: BeniBela on September 25, 2020, 12:31:20 am

I have build my own enumerator for that: http://hg.benibela.de/bbutils/file/a94b6026f7d0/bbutils.pas#l509

Quote from: Martin_fr on September 24, 2020, 05:57:03 pm

Quote from: Blaazen on September 24, 2020, 03:26:17 pm
Here: https://wiki.freepascal.org/UTF8_Tools (https://wiki.freepascal.org/UTF8_Tools) is wiki for UTF8Tools mentioned by Bart. There's download link at the bottom.
How up to date is it?

It seems to have a hardcoded datafile, and the file date is from 2008?
There are new chars added to the unicode standard every year.

Then it is from 2008.

I took that file and updated it in 2016. Guess it is time to update it again: http://hg.benibela.de/internettools/file/default/data/bbunicodeinfo.pas

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: Thaddy on September 25, 2020, 08:41:37 am

Why this? Just curious...

Code: Pascal [Select][+]

const
  UTF8PROC_NULLTERM = 1 shl 0;

It has no impact on code generation, but it looks silly.

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: PascalDragon on September 25, 2020, 09:08:33 am

Quote from: winni on September 24, 2020, 11:29:10 pm

Quote from: PascalDragon on September 24, 2020, 10:42:41 pm

The additions of the most recent Unicode version (namely 13.0 from March 2020) are listed here (https://unicode.org/versions/Unicode13.0.0/).

Yeah - Emojis are fighting for women's liberation!

It's about diversity. The Unicode consortium is working on getting the gender specific emojis done in both a variant of the other gender and a gender-neutral one. I personally - as a genderfluid person - appreciate that very much.

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: Hartmut on September 25, 2020, 10:43:07 am

Thanks again to all for your many replies. I will continue to answer them ony by one.

@Martin_fr: (from reply #4 and #6)
I had never heard of Codepoints before. Thanks for clarification. I will keep it in mind.

Quote from: Aidex on September 24, 2020, 02:33:22 pm

UCS4Char is a 32 bit char, UCS4String is an array of UCS4Char.
function UnicodeStringToUCS4String(const s: UnicodeString): UCS4String;
The function converts a Unicode string to a UCS-4 encoded string.
https://www.freepascal.org/docs-html/rtl/system/unicodestringtoucs4string.html

This sounds interesting, because it seems not to need Unit LazUTF8. But I could not get it to work. Your links says, that a widestring manager is required. After searching in google I found, that on Linux "uses cwstring" should do the job. For Windows I only found in a reliable time, that "uses LazUTF8" would automatically include a widestring manager. But both OS did not work (FPC 3.0.4). Here is my demo:

Code: Pascal [Select][+]

{$mode objfpc}{$H+}
 
{$IFDEF LINUX}
   uses cwstring; // install widestring manager
{$ENDIF}
{$IFDEF WINDOWS}
   uses LazUTF8;  // should include the widestring manager (?)
{$ENDIF}  
 
procedure test1; 
   var s: UnicodeString;
       z: UCS4String;
       i: integer;
   begin
   s:='AB äöüß ÄÖÜ 12';
   writeln('len(s)=', length(s));
   for i:=1 to length(s) do  write(ord(s[i]), ' ');
   writeln;
 
   z:= UnicodeStringToUCS4String(s);
   writeln('len(z)=', Length(z));
   for i:=0 to High(z) do  write(ord(z[i]), ' ');
   writeln;
   end;

The output shows (both for Windows and Linux):
len(s)=21 65 66 32 195 164 195 182 195 188 195 159 32 195 132 195 150 195 156 32 49 50 len(z)=22 65 66 32 195 164 195 182 195 188 195 159 32 195 132 195 150 195 156 32 49 50 0
Do you have an idea why it does not work?

Quote from: Blaazen on September 24, 2020, 03:26:17 pm

Here: https://wiki.freepascal.org/UTF8_Tools (https://wiki.freepascal.org/UTF8_Tools) is wiki for UTF8Tools mentioned by Bart. There's download link at the bottom.

Thank you for that link. I had a look into the sources and found, that they all require Unit LazUTF8, which I want to avoid if possible (see reply #12) for such a primitive usage as I have now. But I wrote me a notice for this link for possible future needs.

Quote from: Martin_fr on September 24, 2020, 11:13:12 pm

When the installer is build, the ppu will contain the path to the unit source, as it is on the machine that was used to build the installer.
So unless you recompile those units, the compiler thinks that is the path where the unit is (or used to be).

Thanks for this info, I did not know before.

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: winni on September 25, 2020, 12:31:12 pm

@Hartmut

Hi!

Don't get confused by all the infos about UTF8.

I show you a very simple way to separe the UTF8chars of a string into a
StringList.

Code: Pascal [Select][+]

uses ....LazUTF8, lclType;     
 
 
procedure TForm1.Button4Click(Sender: TObject);
const   MyUTF8 = 'Wir müssen uns nicht ärgern über UTF8! &#128064; ';
var St : TStringList;
    UChar : TUTF8Char;
    i : integer;
begin
St := TStringList.Create;
for i := 1 to Utf8Length(MyUTF8) do
   begin
     UChar := UTF8Copy (MyUTF8,i,1);
     St.add(UChar);
   end;
showMessage (St.Text);
St.Free;
end;
 
 
 

Winni

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: Martin_fr on September 25, 2020, 01:05:43 pm

Quote from: winni on September 25, 2020, 12:31:12 pm

I show you a very simple way to separe the UTF8chars of a string into a
StringList.

Code: Pascal [Select][+][-]
for i := 1 to Utf8Length(MyUTF8) do
begin
UChar := UTF8Copy (MyUTF8,i,1);

Fine for shorter strings...

But try that on longer strings. Lets say 100,000 bytes long. Takes 12 seconds on a I9 8600K @4.7Ghz (no debugging / O3). And that is without adding to the stringlist. Only doing 100,000 Utf8Copy.

Try it with 200,000 => 50 seconds.

Its O(n^2). It gets a lot slower when you increase the input size.

Something like this should do the work (getting codepoints / as Utf8Copy also gets codepoints)

Code: Pascal [Select][+]

 CurCharStart:= 1;
 while CurCharStart < Length(MyUTF8) do begin
   NextCharStart := CurCharStart + 1;
   while (NextCharStart < Length(MyUTF8)) and ((ord(MyUTF8[NextCharStart]) and $C0) = $80) do
     inc(NextCharStart);
   UChar := copy(MyUTF8, CurCharStart, NextCharStart - CurCharStart);
 
  // Process the codepoint in UChar
 
   CurCharStart := NextCharStart;
 end;
 

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: Hartmut on September 25, 2020, 01:30:17 pm

Thanks Winni for that demo. It's easy to understand, but also requires Unit LazUTF8, which I want to avoid (see reply #12) for such a primitive usage as I have now.
If nobody gets function system.UnicodeStringToUCS4String() from Aidex to work (see problem in reply #19), then I will create the 3 very simple UTF8-functions, which I only need now for my primitive usage.

Thanks Martin_fr for that improvement. If I will create my own functions, I thought about something like that. But I still hope, that someone finds out, why function system.UnicodeStringToUCS4String() from Aidex does not work in my case (see reply #19).

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: rvk on September 25, 2020, 01:51:39 pm

Quote from: Hartmut on September 25, 2020, 10:43:07 am

The output shows (both for Windows and Linux):
len(s)=21 65 66 32 195 164 195 182 195 188 195 159 32 195 132 195 150 195 156 32 49 50 len(z)=22 65 66 32 195 164 195 182 195 188 195 159 32 195 132 195 150 195 156 32 49 50 0
Do you have an idea why it does not work?

UCS4String always has a 'hard' #0 termination (explicit terminating #0).
This is seen in UnicodeStringToUCS4String() and the called UCS4Encode().
(reslen is reset before filling the string and there is a hard #0 placed at the end)

You see that UCS4StringToUnicodeString and UCS4StringToWideString both strip that #0. If there was no #0 all these functions would fail.
and you see in UCS4Decode().

Code: Pascal [Select][+]

procedure UCS4Decode(const s: UCS4String; dest: PWideChar);
var
  i: sizeint;
  nc: UCS4Char;
begin
  for i:=0 to length(s)-2 do  { -2 because s contains explicit terminating #0 }
 

BTW. This is also the case in Delphi.
https://en.delphipraxis.net/topic/1820-ucs4strings/

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: Martin_fr on September 25, 2020, 02:06:36 pm

Quote from: Hartmut on September 25, 2020, 10:43:07 am

Code: Pascal [Select][+][-]
{$mode objfpc}{$H+}
var s: UnicodeString;
z: UCS4String;
i: integer;
begin
s:='AB äöüß ÄÖÜ 12';
z:= UnicodeStringToUCS4String(s);

The output shows (both for Windows and Linux):
len(s)=21 65 66 32 195 164 195 182 195 188 195 159 32 195 132 195 150 195 156 32 49 50 len(z)=22 65 66 32 195 164 195 182 195 188 195 159 32 195 132 195 150 195 156 32 49 50 0
Do you have an idea why it does not work?

https://wiki.freepascal.org/FPC_Unicode_support#UnicodeString.2FWideString

UnicodeString is not Utf8String

UnicodeString is a string with 16bit codeunits. (words)
Utf8String is a string with 8 bit codeunits (bytes)

Assinging
s:='AB äöüß ÄÖÜ 12';

AFAIK converts the string 'AB äöüß ÄÖÜ 12'; from the source codepage to Utf16.

I am not sure why the ä (195 164) is not converted to a single codepoint.
Probably you need to include
{$codepage utf8}
on top of your source.

UnicodeStringToUCS4String then eliminates surrogates, fitting them into a single ucs4 codepoint.

Combining codepoints are left in place.

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: winni on September 25, 2020, 03:13:55 pm

Hi!

Martin_fr remark that very long strings in Lazarus are bloody low is right.
I noticed that as I wanted to read a json multipolygone with the borders of europe:

one line with > 1 Mio chars

But if you know the internals of UTF8 you can write a very short solution without needing some UTF8 units.

Code: Pascal [Select][+]

procedure TForm1.Button1Click(Sender: TObject);
const MyUTF : string = 'ÄÖÜ ²³¼½µ Test æſðđŋħ&#127137;test';
 
var s: string;
    i : integer= 0;
    len : integer;
    p : pchar;
begin
 
p := pchar(MyUTF);
while i < length(myUTF) do
 begin
   case ord(p^) of
            0..127  : len := 1;
            192..223: len := 2;
            224..239: len := 3;
            240..244: len := 4;
   end; // case
 setLength(s,len);
 move (p^,s[1],len);
 showMessage (s+' / '+IntToStr(len));
 inc(p,len);
 inc(i,len);
 end; //while
end;
 

As you can see the length of a UTF8char or codepoint is defined through the start byte.

There is no error checking done but I hope the aera of broken UTF8chars is over.

Winni

PS 🂡 is a product of this editor.
It is the Ace of Spades: RIP Lemmy Kilmister

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: Hartmut on September 25, 2020, 03:40:51 pm

Thanks rvk for your post, but we misunderstood: the problem is not the additional "0" at the end. The result is wrong, because function UnicodeStringToUCS4String() should split the input into it's characters (not bytes). That means, an input of e.g. character "ä" = 2 Bytes = "195 164" should be converted into 1 value, not 2 values.

Thanks a lot Martin_fr for your reply. You are right, UnicodeString is not Utf8String, I did not pay attention to it. As you can see from the above output, string 's' obviously is in UTF8, not Unicode.

As recommended I added {$codepage utf8} to the top of my source. But after this string 's' contained Windows-charset (Ansi 1252?) - very strange (currently I'm on Linux):
len(s)=14 41 42 20 E4 F6 FC DF 20 C4 D6 DC 20 31 32 len(z)=15 41 42 20 E4 F6 FC DF 20 C4 D6 DC 20 31 32 00
Info: my sourcefile was already in UTF8.

Then I changed my code to:

Code: Pascal [Select][+]

...
var s0: UTF8String;
    s: UnicodeString;
begin
s0:='AB äöüß ÄÖÜ 12'; // should store as UTF8
s:=UnicodeString(s0); // should convert to Unicode to make UnicodeStringToUCS4String() happy
...

but the result again was:
len(s)=21 41 42 20 C3 A4 C3 B6 C3 BC C3 9F 20 C3 84 C3 96 C3 9C 20 31 32 len(z)=22 41 42 20 C3 A4 C3 B6 C3 BC C3 9F 20 C3 84 C3 96 C3 9C 20 31 32 00

Does anybody know, why above
s:=UnicodeString(s0);
does not convert from UTF8 to Unicode?

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: rvk on September 25, 2020, 04:09:21 pm

Quote from: Hartmut on September 25, 2020, 03:40:51 pm

Does anybody know, why above
s:=UnicodeString(s0);
does not convert from UTF8 to Unicode?

String literals have always been a bit confusing to me in FPC.

Take a look at this:
https://wiki.freepascal.org/Unicode_Support_in_Lazarus#String_Literals
According to that page, assigning your literal string to a unicodestring will fail (without the {$codepage utf8}).

What version of FPC are you using??

Using const s : String = 'AB äöüß ÄÖÜ 12'; it might work better.
or even var s : String = 'AB äöüß ÄÖÜ 12';

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: PascalDragon on September 25, 2020, 04:36:09 pm

Quote from: Hartmut on September 25, 2020, 03:40:51 pm

Then I changed my code to:
Code: Pascal [Select][+][-]
...
var s0: UTF8String;
s: UnicodeString;
begin
s0:='AB äöüß ÄÖÜ 12'; // should store as UTF8
s:=UnicodeString(s0); // should convert to Unicode to make UnicodeStringToUCS4String() happy
...
but the result again was:
len(s)=21 41 42 20 C3 A4 C3 B6 C3 BC C3 9F 20 C3 84 C3 96 C3 9C 20 31 32 len(z)=22 41 42 20 C3 A4 C3 B6 C3 BC C3 9F 20 C3 84 C3 96 C3 9C 20 31 32 00

Does anybody know, why above
s:=UnicodeString(s0);
does not convert from UTF8 to Unicode?

The following command line utility prints the correct data (I have stored it as UTF-8 with BOM):

Code: Pascal [Select][+]

program tunicode;
 
{$mode objfpc}{$H+}
{$codepage utf8}
 
var s: UnicodeString;
    u: UTF8String;
    z: UCS4String;
    i: integer;
begin
  u:='AB äöüß ÄÖÜ 12';
  s:=u;
  z:= UnicodeStringToUCS4String(s);
 
  Writeln(Length(s));
  Writeln(Length(z));
 
  for i := 1 to Length(s) do
    Write(HexStr(Ord(s[i]), 2), ' ');
  Writeln;
 
  for i := 0 to High(z) do
    Write(HexStr(Ord(z[i]), 8), ' ');
  Writeln;
end.

Output:

Code: [Select]

14
15
41 42 20 E4 F6 FC DF 20 C4 D6 DC 20 31 32
00000041 00000042 00000020 000000E4 000000F6 000000FC 000000DF 00000020 000000C4 000000D6 000000DC 00000020 00000031 00000032 00000000

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: rvk on September 25, 2020, 04:41:29 pm

Quote from: PascalDragon on September 25, 2020, 04:36:09 pm

u:='AB äöüß ÄÖÜ 12';

41 42 20 E4 F6 FC DF 20 C4 D6 DC 20 31 32

Which is strange because the middle 4 character (first after the space) are E4 F6 FC DF.
And those are not UTF-8 characters, are they?
Every character above hex $80 should have multiple bytes, shouldn't they?

https://en.wikipedia.org/wiki/UTF-8

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: winni on September 25, 2020, 04:48:16 pm

Hi!

Definitly wrong.

German äöüß ÄÖÜ are all in Latin 1 Supplement and start all with $C2 or $C3.
Something wrong.

Winni

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: wp on September 25, 2020, 05:45:01 pm

No, Martin is right:

Code: Pascal [Select][+]

procedure TForm1.Button1Click(Sender: TObject);
const
  s1 = #$C3#$a4;
  s2 = 'a' + #$CC#$88;
begin
  ShowMessage('Is ' + s1 + ' the same character as ' + s2 + '?');  
end;  

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: Martin_fr on September 25, 2020, 06:49:52 pm

Quote from: wp on September 25, 2020, 05:45:01 pm

No, Martin is right:

Code: Pascal [Select][+][-]
procedure TForm1.Button1Click(Sender: TObject);
const
s1 = #$C3#$a4;
s2 = 'a' + #$CC#$88;
begin
ShowMessage('Is ' + s1 + ' the same character as ' + s2 + '?');
end;

But that is about composition.

That has nothing to do, that for some reason the unicodestring (utf16) does not have the same codepoints as the utf8 string. It has the bytes (codeunits) of the utf8 string, all extended to words. But in utf16 those have a completely different meaning.

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: Hartmut on September 25, 2020, 07:22:09 pm

Quote from: rvk on September 25, 2020, 04:09:21 pm

Take a look at this:
https://wiki.freepascal.org/Unicode_Support_in_Lazarus#String_Literals
According to that page, assigning your literal string to a unicodestring will fail (without the {$codepage utf8}).

What version of FPC are you using??

Using const s : String = 'AB äöüß ÄÖÜ 12'; it might work better.
or even var s : String = 'AB äöüß ÄÖÜ 12';

From your link, using {$codepage utf8} should be correct, as Martin_fr recommended too. I had tried it already (see reply #26), but after this string 's' contained a Windows-charset (Ansi 1252?), although I was currently on Linux.

I made all Tests with FPC 3.0.4 (on Windows 7 32-bit or Linux Ubuntu 18.04 64-bit).

I tried "const s : String = 'AB äöüß ÄÖÜ 12';" with and without {$codepage utf8} and then both times 's' had UTF8-charset and 'z' had Windows-charset (Ansi 1252?):
len(s)=21 41 42 20 C3 A4 C3 B6 C3 BC C3 9F 20 C3 84 C3 96 C3 9C 20 31 32 len(z)=15 41 42 20 E4 F6 FC DF 20 C4 D6 DC 20 31 32 00
To use "var s : String = 'AB äöüß ÄÖÜ 12';" made no difference.

Hello PascalDragon, I tried your demo (reply #28) and got the same results as you. But to be honest, I don't understand what you want to show me with that. Both 's' and 'z' show Bytes, not characters, which are in Windows-charset (Ansi 1252?), although I am currently on Linux. What I want is to split an UTF8-string into it's UTF8-characters.

@rvk (reply #29) and @Winni (reply #30):
What you see is not UTF8, this is Windows-charset (Ansi 1252?), as I wrote multiple times.

@wp (reply #31) and @Martin_fr (reply #32):
From my understanding this has nothing to do with the problem, that system.UnicodeStringToUCS4String() does not work :-)

I think we all have spent now (more than) enough time for this problem. Me about 2 days.
As noted before, now I will write the 3 very simple UTF8-functions, which I only need now for my primitive usage. Should not take more than 1 hour (including testing). Tomorrow I will report if I was successful.

Thanks a lot to all that you tried to help me and for your informations. Again I learned a lot.

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: Thaddy on September 25, 2020, 08:45:26 pm

Why did you not try with a supported version, like 3.2.0 instead of the unsupported 3.0.4?

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: BeniBela on September 25, 2020, 10:07:10 pm

Quote from: Hartmut on September 25, 2020, 10:43:07 am

Thank you for that link. I had a look into the sources and found, that they all require Unit LazUTF8, which I want to avoid if possible (see reply #12) for such a primitive usage as I have now. But I wrote me a notice for this link for possible future needs.

My enumerator should not require LazUTF8

Quote from: Thaddy on September 25, 2020, 08:41:37 am

Why this? Just curious...
Code: Pascal [Select][+][-]
const
UTF8PROC_NULLTERM = 1 shl 0;
It has no impact on code generation, but it looks silly.

That is copied directly from there: https://github.com/JuliaStrings/utf8proc/blob/master/utf8proc.h#L146-L167

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: Hartmut on September 26, 2020, 09:10:36 am

Quote from: Thaddy on September 25, 2020, 08:45:26 pm

Why did you not try with a supported version, like 3.2.0 instead of the unsupported 3.0.4?

I had tried with a 3.2.0 beta, but it made no difference, so I didn't mention it.

Quote from: BeniBela on September 25, 2020, 10:07:10 pm

My enumerator should not require LazUTF8

I have had a look in both of your links (reply #16), but didn't understand much, what I saw and didn't find something, which looked to me that it could help me (don't know what an "enumerator" is and how this could solve my problem). I thought both links were about updating old datafiles.

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: Hartmut on September 26, 2020, 09:23:37 am

Because we could not find out, why system.UnicodeStringToUCS4String() did not work and because I did not want to accept so many problems and disadvantages of Unit LazUTF8 (see reply #12) - for such a primitive usage which I have now - I wrote the 3 very simple UTF8-functions which I only needed:
- they work without Unit LazUTF8
- they work on Windows and Linux (32-bit and 64-bit)
- for me they are fast enough (my UTF8-strings are not longer than about 200 characters).
With this functions I have created my little compare utility and it works perfect.

Code: Pascal [Select][+]

// only unit system is needed
type StrUTF8 = ansistring; {a separate type allows easy changes for experiments}
     CharUTF8 = string[4]; {space for 1 UTF8-character}
 
function charLen_UTF8(c: char): integer;
   {returns the length in bytes of an UTF8-character which starts with 'c'}
   begin
   case ord(c) shr 4 of
      $C,$D: exit(2);
      $E:    exit(3);
      $F:    exit(4);
      else   exit(1);
   end;
   end;
 
function length_UTF8(s: StrUTF8): PtrInt;
   {returns the number of UTF8-characters, which UTF8-String 's' has}
   var len,i: PtrInt;
       a: integer;
   begin
   len:=0; i:=1;
   while i <= length(s) do
      begin
      a:=charLen_UTF8(s[i]); {length in bytes of current UTF8-character}
      if i+a-1 > length(s) then exit(len); {don't count incomplete characters}
      inc(len);
      inc(i,a);
      end;
   exit(len);
   end;
 
function getChar_UTF8(s: StrUTF8; p: PtrInt): CharUTF8;
   {returns UTF8-character with number 'p' out of UTF8-String 's' or empty
    string, if p < 1 or p is too big}
   var i,n: PtrInt;
       a: integer;
   begin
   if p < 1 then exit('');
 
   i:=1; n:=0; {'n' counts already found UTF8-characters}
   while i <= length(s) do
      begin
      a:=charLen_UTF8(s[i]); {length in bytes of current UTF8-character}
      if i+a-1 > length(s) then exit(''); {without incomplete characters}
      inc(n); {next valid UTF8-character was found}
      if n=p then exit(copy(s,i,a));
      inc(i,a);
      end;
 
   exit(''); {if 'p' was too big}
   end;

Thanks again a lot to all who tried to help me.

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: PascalDragon on September 26, 2020, 09:53:06 am

Quote from: Hartmut on September 25, 2020, 07:22:09 pm

Hello PascalDragon, I tried your demo (reply #28) and got the same results as you. But to be honest, I don't understand what you want to show me with that. Both 's' and 'z' show Bytes, not characters, which are in Windows-charset (Ansi 1252?), although I am currently on Linux. What I want is to split an UTF8-string into it's UTF8-characters.

If you look at the assembly code you'll see that the string itself is encoded as UTF-8 and it's already correctly converted to UTF-16 upon the assignment. In the case of your specific example nothing obvious happens, because all characters you used are code points <= $FF.

Quote from: Hartmut on September 26, 2020, 09:23:37 am

Because we could not find out, why system.UnicodeStringToUCS4String() did not work and because I did not want to accept so many problems and disadvantages of Unit LazUTF8 (see reply #12) - for such a primitive usage which I have now - I wrote the 3 very simple UTF8-functions which I only needed:

No, it's not UnicodeSTringToUCS4String that did not work, it worked absolutely correctly with the input it got. What did not work is your input string. It was not correctly converted to a valid UTF-16 string and that is the problem.

Title: Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?
Post by: Hartmut on September 26, 2020, 10:42:09 am

Quote from: PascalDragon on September 26, 2020, 09:53:06 am

Quote from: Hartmut on September 25, 2020, 07:22:09 pm
Hello PascalDragon, I tried your demo (reply #28) and got the same results as you. But to be honest, I don't understand what you want to show me with that. Both 's' and 'z' show Bytes, not characters, which are in Windows-charset (Ansi 1252?), although I am currently on Linux. What I want is to split an UTF8-string into it's UTF8-characters.

If you look at the assembly code you'll see that the string itself is encoded as UTF-8 and it's already correctly converted to UTF-16 upon the assignment. In the case of your specific example nothing obvious happens, because all characters you used are code points <= $FF.

I'm not familiar reading assembly code. But you wrote in reply #28 "I have stored it as UTF-8 with BOM". That lets me guess, that you wanted to add an attachment, but there is no.

Quote from: PascalDragon on September 26, 2020, 09:53:06 am

Quote from: Hartmut on September 26, 2020, 09:23:37 am
Because we could not find out, why system.UnicodeStringToUCS4String() did not work and because I did not want to accept so many problems and disadvantages of Unit LazUTF8 (see reply #12) - for such a primitive usage which I have now - I wrote the 3 very simple UTF8-functions which I only needed:

No, it's not UnicodeSTringToUCS4String that did not work, it worked absolutely correctly with the input it got. What did not work is your input string. It was not correctly converted to a valid UTF-16 string and that is the problem.

I believe that you are right, that UnicodeSTringToUCS4String() "only" got the wrong input. But you see, how many experiments we have tried, to solve the problem, but nobody found a solution which worked.

Title: Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?
Post by: PascalDragon on September 26, 2020, 12:06:49 pm

Quote from: Hartmut on September 26, 2020, 10:42:09 am

Quote from: PascalDragon on September 26, 2020, 09:53:06 am
Quote from: Hartmut on September 25, 2020, 07:22:09 pm
Hello PascalDragon, I tried your demo (reply #28) and got the same results as you. But to be honest, I don't understand what you want to show me with that. Both 's' and 'z' show Bytes, not characters, which are in Windows-charset (Ansi 1252?), although I am currently on Linux. What I want is to split an UTF8-string into it's UTF8-characters.

If you look at the assembly code you'll see that the string itself is encoded as UTF-8 and it's already correctly converted to UTF-16 upon the assignment. In the case of your specific example nothing obvious happens, because all characters you used are code points <= $FF.

I'm not familiar reading assembly code. But you wrote in reply #28 "I have stored it as UTF-8 with BOM". That lets me guess, that you wanted to add an attachment, but there is no.

No, I had not intended to attach a project. And as you wrote it worked for you, so I don't need to attach one anyway.

Quote from: Hartmut on September 26, 2020, 10:42:09 am

Quote from: PascalDragon on September 26, 2020, 09:53:06 am
Quote from: Hartmut on September 26, 2020, 09:23:37 am
Because we could not find out, why system.UnicodeStringToUCS4String() did not work and because I did not want to accept so many problems and disadvantages of Unit LazUTF8 (see reply #12) - for such a primitive usage which I have now - I wrote the 3 very simple UTF8-functions which I only needed:

No, it's not UnicodeSTringToUCS4String that did not work, it worked absolutely correctly with the input it got. What did not work is your input string. It was not correctly converted to a valid UTF-16 string and that is the problem.

I believe that you are right, that UnicodeSTringToUCS4String() "only" got the wrong input. But you see, how many experiments we have tried, to solve the problem, but nobody found a solution which worked.

I have attached an example LCL project that shows the important point:
- either your file needs to be stored as “UTF-8 with BOM” (you need to do a right click in the editor, go to “File Settings” (or similar, I'm using German) and then “Character Encoding”, confirm the dialog to change the file)
- or you need to store it as “UTF-8” and add {$codepage utf8}

In both cases the constant string data will be stored as UTF-16 if you simply assign it to a UnicodeString (or as UTF-8 data if you assign it to a String or UTF8String).

Title: Re: How to split UTF8-strings into it's characters (not bytes)?
Post by: wp on September 26, 2020, 12:13:28 pm

Quote from: Hartmut on September 24, 2020, 09:56:48 pm

With Unit LazUTF8 I faced a lot of problems and disadvantages in the past. Some Examples I remember immediately:
- on Windows in console programs adding Unit LazUTF8 changes the charset of the filenames reported by sysutils.FindFirst() and FindNext() and the charset of the results of ParamStr() and the results of readln(). Without Unit LazUTF8 they return Windows-Charset (Ansi 1252?) / with Unit LazUTF8 they return UTF8. This difference makes live not easy.
- during decades I have written a couple of common libraries for console programs, which partly have problems with this differing charsets, so I get wrong results with Unit LazUTF8 because of the changed charset
- for a couple of programs I still use FPC 2.6.4 (with the same common libraries), but in FPC 2.6.4 obove functions never return UTF8 on Windows
- Windows-charset generally is much easier than UTF8 (as we see now), because each char is only 1 byte long
- in some cases (but not always) I had problems, that writeln() for an 'ansistring' showed damaged characters for Ä Ö Ü ä ö ü ß after I added Unit LazUTF8.
And there were more issues, which I only don't remember in a sudden. So I want to avoid Unit LazUTF8, especially in console programs, wherever possible.

To be honest I do not fully understand your description, but you seem to have issues with usage of LazUTF8. I cannot imagine that LazUTF8 is the source of such errors; I use it in many projects and don't have any problems with it. But the one thing that I learned while trying to understand many issues with UTF8 is that one conversion at the wrong place can cause unrecoverable errors. So, please check your code and make sure that every string conversion is in place and needed. And don't expect programs written for fpc 3.x to work correctly with 2.6.4 or older, because v3 introduces a massive change in string handling.

I am attaching a demo for FindFirst which finds a file "testäöü.txt" which should cause issues according to your description. Me me, it does not. The filename is displayed in the console correctly with fpc 3.2 as well as 2.6.4 (after conversion). (NOTE: my windows cp is 1252. If yours is different the filename may appear differently than shown here). And it does not make a difference whether LazUTF8 is linked in or not (activate/deactivate the define USE_LAZUTF8).

Title: Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?
Post by: Hartmut on September 26, 2020, 02:09:16 pm

Quote from: PascalDragon on September 26, 2020, 12:06:49 pm

I have attached an example LCL project that shows the important point:
- either your file needs to be stored as “UTF-8 with BOM” (you need to do a right click in the editor, go to “File Settings” (or similar, I'm using German) and then “Character Encoding”, confirm the dialog to change the file)
- or you need to store it as “UTF-8” and add {$codepage utf8}

Thanks PascalDragon for your demo. I attached its output as screenshot1. I see exactly the same result as in your reply #28 and the output again is not UTF8, which is, what I need.
BTW: my source had already been stored as UTF8 and I had used {$codepage utf8} already before as mentioned in reply #26. And in my real world the UTF8-strings which I want to compare are no constants, they are returned in UTF8 from a function.

Quote

In both cases the constant string data will be stored as UTF-16 if you simply assign it to a UnicodeString (or as UTF-8 data if you assign it to a String or UTF8String).

Then I made 2 more tests after changing var 'u' to String and to UTF8String. Their outputs were identical, I attached it as screenshot2. Now we see, that the input shows the single bytes of UTF8, but the output is identical to screenshot1 - nothing has changed. That's why I think that I can't use UnicodeSTringToUCS4String() in my case.

What I needed was to split an UTF8-string into it's UTF8-characters, not in something else. Therefore I have written my own functions yesterday (see reply # 37), the problem is solved.

@wp: Thanks for your reply, I will check it and report later.

Title: Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?
Post by: Hartmut on September 26, 2020, 06:09:48 pm

Quote from: wp on September 26, 2020, 12:13:28 pm

And don't expect programs written for fpc 3.x to work correctly with 2.6.4 or older, because v3 introduces a massive change in string handling.

Oh, there we misunderstood: I do not try to run programs with FPC 2.6.4, which have been written for FPC 3.x. What I have is a couple of "common libraries", which exist only once, and I use these both for all new programs with FPC 3.x and for a couple of older programs, which I still compile with FPC 2.6.4, because they have originally been written for 2.6.4, and (until now) I did not update them to 3.x.

Quote

I am attaching a demo for FindFirst which finds a file "testäöü.txt" which should cause issues according to your description. Me me, it does not. The filename is displayed in the console correctly with fpc 3.2 as well as 2.6.4 (after conversion). (NOTE: my windows cp is 1252. If yours is different the filename may appear differently than shown here). And it does not make a difference whether LazUTF8 is linked in or not (activate/deactivate the define USE_LAZUTF8).

Thank you for your demo and the time you invested to help me. I found out, that to activate/deactivate the define USE_LAZUTF8 made no difference. I found the reason in, that Unit LConvEncoding is used, which contains:

Code: Pascal [Select][+]

uses SysUtils, Classes, dos, LazUTF8

so LazUTF8 was always included and you saw no difference.

After deactivating units LazUTF8 and LConvEncoding I had exactly the difference, which I was talking about. I added some lines to make the difference clearer:

Code: Pascal [Select][+]

begin
...
  if FindFirst('test*.*', faAnyFile, SR) = 0 then // to catch file 'testäöü.txt'
  begin
    repeat
      WriteLn('Filename as returned by FindFirst: ', SR.Name);
      for i:=1 to length(SR.Name) do  write(HexStr(ord(SR.Name[i]),2), ' ');
      writeln('len=', length(SR.Name)); 
...      
     until FindNext(SR) <> 0;
    FindClose(SR);
  end;
...  
end.

Here is the output, depending on FPC version and whether LazUTF8 (including LConvEncoding) was enabled or not (all on Windows 7 32-bit):
FPC LazUTF8 Output --------------------- 3.0.4 yes 74 65 73 74 5F C3 A4 C3 B6 C3 BC 2E 74 78 74 len=15 3.0.4 no 74 65 73 74 5F E4 F6 FC 2E 74 78 74 len=12 2.6.4 yes 74 65 73 74 5F E4 F6 FC 2E 74 78 74 len=12 2.6.4 no 74 65 73 74 5F E4 F6 FC 2E 74 78 74 len=12

We see, that in FPC 3.x the same FindFirst() function changes the charset of the reported filenames between UTF8-charset and Windows-charset, whether you include Unit LazUTF8 or not. All procedures and functions which deal with filenames switch the same. And as I wrote, the charsets of the results of ParamStr() and the results of readln() play the same game. And it might be more (which I only not remember now).

I hope that you now will believe that my list of problems and disadvantages with Unit LazUTF8 is real.

Title: Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?
Post by: nanobit on September 26, 2020, 06:34:44 pm

You should update to FPC 3.2.

Title: Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?
Post by: wp on September 26, 2020, 07:13:06 pm

Quote from: Hartmut on September 26, 2020, 06:09:48 pm

Quote from: wp on September 26, 2020, 12:13:28 pm
And don't expect programs written for fpc 3.x to work correctly with 2.6.4 or older, because v3 introduces a massive change in string handling.

Oh, there we misunderstood: I do not try to run programs with FPC 2.6.4, which have been written for FPC 3.x. What I have is a couple of "common libraries", which exist only once, and I use these both for all new programs with FPC 3.x and for a couple of older programs, which I still compile with FPC 2.6.4, because they have originally been written for 2.6.4, and (until now) I did not update them to 3.x.

FPC changed string handling massively with version 3.0.x. Sticking to a library which is not updated to fpc 3 calls for trouble with strings. You must take the time to update your shared units. Otherwise the only real recommendation is to stick to 2.6.4.

Quote from: Hartmut on September 26, 2020, 06:09:48 pm

I found the reason in, that Unit LConvEncoding is used, which contains:
Code: Pascal [Select][+][-]
uses SysUtils, Classes, dos, LazUTF8
so LazUTF8 was always included and you saw no difference.

After deactivating units LazUTF8 and LConvEncoding I had exactly the difference, which I was talking about. I added some lines to make the difference clearer:
[...]

Yes thanks, this shows me the difference. But these changes are not introduced by LazUTF8 but by unit FPCAdds in order to adapt to the new FPC strings. A problem could be for you that this unit is "used" by some other units, independently of LazUTF8: Do a "Find in files" over the lcl directory of your Lazarus installation and you'll find it in graphics, intfgraphics, imglist, lresources. OK - your interest is in console programs, so you probably will not use them.

What FPCAdds does is it sets codepage conversion code in its initialization section - see also https://wiki.freepascal.org/Unicode_Support_in_Lazarus#Technical_implementation. And that gives me an idea: Why not call them at the beginning of your project with the default Windows ANSI codepage, and you'll avoid UTF8-conversion even if LazUTF8 is in "uses". As shown below this works in my console program (tested Win 10):

Code: Pascal [Select][+]

program Project1;
 
{$DEFINE USE_LazUTF8}
 
uses
  Windows,   // for: GetConsoleOutputCP
  Sysutils //, FPCAdds
  {$IFDEF USE_LAZUTF8}
  , LazUTF8, LConvEncoding
  {$ENDIF}
  ;
 
var
  consCP: Integer;
  winCP: Integer;
  SR: TSearchRec;
  i: Integer;
 
begin
  consCP := GetConsoleCP;
  winCP := GetACP;
 
  SetMultiByteConversionCodePage(winCP);
  SetMultiByteFileSystemCodePage(winCP);
  SetMultiByteRTLFileSystemCodePage(winCP);
 
  WriteLn('The codepage of the console is ', consCP);
  WriteLn('System codepage is ', winCP);
  WriteLn;
 
  if FindFirst('test*.*', faAnyFile, SR) = 0 then    // to catch file 'testäöü.txt'
  begin
    repeat
      WriteLn('Filename as returned by FindFirst: ', SR.Name);
      for i:=1 to length(SR.Name) do  write(HexStr(ord(SR.Name[i]),2), ' ');
      writeln('len=', length(SR.Name));
      {$IF FPC_FullVersion < 30000}
      WriteLn('Filename after codepage conversion: ', ConvertEncoding(SR.Name, 'cp'+IntToStr(winCP), 'cp'+IntToStr(consCP)));
      {$IFEND}
     until FindNext(SR) <> 0;
    FindClose(SR);
  end;
 
  {$IFDEF USE_LAZUTF8}
  WriteLn(UTF8Length('äöü'));
  {$ENDIF}
 
  ReadLn;
end.       

Title: Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?
Post by: Hartmut on September 27, 2020, 11:00:18 am

Quote from: nanobit on September 26, 2020, 06:34:44 pm

You should update to FPC 3.2.

What informations do you have that you know, that this will improve anything of the problems in this Topic?
Have you updated to FPC 3.2? If yes, did you try even 1 example in this Topic with it? Did it work better?
And did you read what I wrote:

Quote from: Hartmut on September 26, 2020, 09:10:36 am

I had tried with a 3.2.0 beta, but it made no difference...

Quote from: wp on September 26, 2020, 07:13:06 pm

Yes thanks, this shows me the difference. But these changes are not introduced by LazUTF8 but by unit FPCAdds in order to adapt to the new FPC strings. A problem could be for you that this unit is "used" by some other units, independently of LazUTF8: Do a "Find in files" over the lcl directory of your Lazarus installation and you'll find it in graphics, intfgraphics, imglist, lresources. OK - your interest is in console programs, so you probably will not use them.

You are right, I wrote only a handful of GUI programs, all the rest are console programs.

Quote

What FPCAdds does is it sets codepage conversion code in its initialization section - see also https://wiki.freepascal.org/Unicode_Support_in_Lazarus#Technical_implementation. And that gives me an idea: Why not call them at the beginning of your project with the default Windows ANSI codepage, and you'll avoid UTF8-conversion even if LazUTF8 is in "uses". As shown below this works in my console program (tested Win 10)

This sounds verry interesting! If that works it would be great! Thanks a lot for that idea.
Please give me some time for some researches, because you use a couple of functions I never heard of and to find out, what they do and for a couple of tests, if / which problems with Unit LazUTF8 then disappear and which maybe not. Maybe could need some days. I will report afterwards.

Title: Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?
Post by: nanobit on September 27, 2020, 12:37:16 pm

I did not say FPC 3.2 alone would solve your specific problem,
but FPC 3.2 has some unicode related RTL improvements over FPC 3.0.4.
Generally, testers prefer to start with newer (bug-fixed) versions.

And if you had read https://wiki.freepascal.org/FPC_Unicode_support,
you would be less surprised about unicode-settings:
system.defaultSystemCodePage = cp_utf8
system.defaultFileSystemCodePage = cp_utf8
system.defaultRTLFileSystemCodePage = cp_utf8

Title: Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?
Post by: Hartmut on September 28, 2020, 12:28:27 pm

Quote from: nanobit on September 27, 2020, 12:37:16 pm

And if you had read https://wiki.freepascal.org/FPC_Unicode_support,
you would be less surprised about unicode-settings:
system.defaultSystemCodePage = cp_utf8
system.defaultFileSystemCodePage = cp_utf8
system.defaultRTLFileSystemCodePage = cp_utf8

During the years I have read a lot of Wikis and other documentation about UTF8 / Unicode / codepages etc. in FPC and LCL. But several parts I have forgotten again over the time (because I need all this very seldom) and for several parts I did not really understand all what I read, because all this stuff about codepages / Unicode / UTF8 / codepoints / collations etc. etc. is not my world.

So I am very happy that wp found out and explained, where my problems with Unit LazUTF8 originally come from (the Initialization-part of Unit FPCAdds) and his idea, how this could be avoided very easily :-) I'm testing his suggestion and until now it looks promising.

Title: Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?
Post by: nanobit on September 28, 2020, 12:53:32 pm

Ok, but don't forget: If your folder contains a mixture of german (umlaut)
and cyrillic filenames, you still need cp_utf8 instead of winCp.

Title: Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?
Post by: Hartmut on October 02, 2020, 04:11:32 pm

Quote from: wp on September 26, 2020, 07:13:06 pm

What FPCAdds does is it sets codepage conversion code in its initialization section - see also https://wiki.freepascal.org/Unicode_Support_in_Lazarus#Technical_implementation. And that gives me an idea: Why not call them at the beginning of your project with the default Windows ANSI codepage, and you'll avoid UTF8-conversion even if LazUTF8 is in "uses". As shown below this works in my console program (tested Win 10)

Hello wp,
now I have made extensive tests with your suggestion (reply #45) and I have good news!

Quote

With Unit LazUTF8 I faced a lot of problems and disadvantages in the past:
a) on Windows in console programs adding Unit LazUTF8 changes the charset of the filenames reported by sysutils.FindFirst() and FindNext(). Without Unit LazUTF8 they return Windows-Charset (ANSI 1252) / with Unit LazUTF8 they return UTF8. This difference makes live not easy.
b) dito for all other procedures and functions which deal with filenames or folders
c) dito for the charset of the results of ParamStr()
d) dito for the results of readln()
e) during decades I have written a couple of common libraries for console programs, which partly have problems with this differing charsets, so then I get wrong results with Unit LazUTF8 because of the changed charset
f) for a couple of older programs I still use FPC 2.6.4 (with the same common libraries), but in FPC 2.6.4 obove functions never return UTF8 on Windows
g) in some cases (but not always) I had problems, that writeln() for an 'ansistring' showed damaged characters for Ä Ö Ü ä ö ü ß when I added Unit LazUTF8.

I tested all above problems with FPC 3.0.4 and 3.3.1 beta on Windows 7 and with 1 exception all of them are solved by your suggestion. Many many thanks to you for that great and very-easy-to-use idea!

The exception is c) concerning the results of ParamStr(). I created a small demo for that (see attached as compilable project):

Code: Pascal [Select][+]

procedure set_charset_WIN;
   {switches 3 codepages on Windows to ANSI-1252, which have been changed before
    to UTF8, if Unit 'LazUTF8' is included}
   var winCP: UINT; {dword}
   begin
   winCP:=windows.GetACP; {gets System codepage}
   SetMultiByteConversionCodePage(winCP);
   SetMultiByteFileSystemCodePage(winCP);
   SetMultiByteRTLFileSystemCodePage(winCP);
   end;
 
procedure Test_ParamStr;
   {shows the charset returned by system.ParamStr() OR objpas.ParamStr() 
    depending of current Compiler "$mode".
    Usage: start the program in a Windows 7 Console with a command line
    parameter like "äöü".
    If then result = "E4 F6 FC len=3" => WINDOWS-charset (ANSI 1252) /
    if then result = "C3 A4 C3 B6 C3 BC len=6" => UTF8-charset}
   type ansi_1252 = type AnsiString(1252); {Windows-charset}
   var sa: ansistring;
       sw: ansi_1252;
       ss: string[255];
       i: integer;
   begin
   writeln('Results of ParamStr():');
   ss:=ParamStr(1);                  // type shortstring:
   write(' - string[s255] => ');
   for i:=1 to length(ss) do  write(HexStr(ord(ss[i]),2), ' ');
   writeln('len=', length(ss));
 
   sa:=ParamStr(1);                  // type ansistring:
   write(' - ansistring   => ');
   for i:=1 to length(sa) do  write(HexStr(ord(sa[i]),2), ' ');
   writeln('len=', length(sa));
 
   sw:=ParamStr(1);                  // type AnsiString(1252):
   write(' - ansi(1252)   => ');
   for i:=1 to length(sw) do  write(HexStr(ord(sw[i]),2), ' ');
   writeln('len=', length(sw));
   end; 

Info: {$mode objfpc} causes that ParamStr() of Unit 'objpas' is used / {$mode TP} causes that ParamStr() of Unit 'system' is used.
The results are (both in FPC 3.0.4 and 3.3.1 beta):
Unit call of charset charset {$mode} LazUTF8 set_charset_WIN() 'ss+sa' 'sw' --------------------------------------------------------- objfpc without no WIN WIN "" without yes WIN WIN "" with no UTF8 WIN "" with yes UTF8 UTF8 TP without no WIN WIN "" without yes WIN WIN "" with no UTF8 UTF8 "" with yes UTF8 UTF8

We see, that the call of set_charset_WIN() unfortunately does never change from UTF8 to WIN (it makes only a difference in 1 rare case with type 'AnsiString(1252)', but 1) it changes into UTF8, what doesn't help me and 2) I never used type 'AnsiString(1252)' in combination with ParamStr(), so this case is not of interest).

Do you (or someone else) have an idea, how the returned charset of ParamStr() can be switched from UTF8 to WIN, if Unit LazUTF8 is included (without damaging all the other solved cases above)? I search a "global" solution like above procedure set_charset_WIN(), which has only to be called once at the start of a concerned program. Of course I'm not keen on to adapt every single usage of ParamStr() in my programs and libraries individually (more than 200).

Thanks to all for your help!

Title: Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?
Post by: wp on October 04, 2020, 12:45:06 am

I had the idea to compile your demo program with today's FPC 3.3.1 -- and here is the output (mode ObjFPC):

Code: [Select]

Usage: start this program in a Windows 7 Console with a command line parameter like "├ñ├Â├╝"
FPC-Version:  3.3.1
Unit LazUTF8: YES

1) WITHOUT call of set_charset_WIN() => Results of ParamStr():
 - string[s255] => C3 A4 C3 B6 C3 BC len=6
 - ansistring   => C3 A4 C3 B6 C3 BC len=6
 - ansi(1252)   => E4 F6 FC len=3

2) WITH call of set_charset_WIN() => Results of ParamStr():
 - string[s255] => E4 F6 FC len=3
 - ansistring   => E4 F6 FC len=3
 - ansi(1252)   => E4 F6 FC len=3

--> Working! (but not for mode TP)

Then I also tried FPC-fixes, but I get the same result as with FPC 3.2.0 (6 bytes in case (2))

Title: Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?
Post by: Hartmut on October 04, 2020, 03:38:59 pm

Hello wp, thanks a lot for your continuing and valuable help. This is good news, that with a current FPC 3.3.1 the problems with ParamStr() and Unit LazUTF8 at least in mode ObjFPC can also switched off by your solution. I implemented a call to set_charset_WIN() at the start of my console programs which use Unit LazUTF8 (were not very much).

Unfortunately during my Tests with FPC 3.3.1 beta I stumbled over one more new UTF8-Problem, which wasted a lot of time to dive into it: when I read data from a SQLite-DB, for so many years (at least since FPC 2.6.4) this data was always in UTF8 - for console and GUI programs - regardless whether Unit LazUTF8 was used or not. But now (if Unit LazUTF8 and set_charset_WIN() are not used) this data is instead returned in Windows-charset (ANSI 1252)!! But the Select-Statements, if they include characters like Ä Ö Ü ä ö ü ß, have still to be in UTF8!! How crazy!

Reading the release notes for FPC 3.2 one more time I found there https://wiki.freepascal.org/User_Changes_3.2#CodePage_aware_TStringField_and_TMemoField. I do not understand really much of what is written there and what I tried, inspired by this infos, setting e.g. SQLite3Connection1.CharSet:='UTF8' (directly after dynamic creation of that var), did not help. But I want to continue to research this at a later moment and if necessary, will open a new Topic for this new problem.