[SOLVED] How to split UTF8-strings into it's characters (not bytes)?

winni

Hero Member
Posts: 3197

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #15 on: September 24, 2020, 11:29:10 pm »

Quote from: PascalDragon on September 24, 2020, 10:42:41 pm

The additions of the most recent Unicode version (namely 13.0 from March 2020) are listed here.

Yeah - Emojis are fighting for women's liberation!

Now we got MRS SANTA CLAUS !!!

Instead of abolishing the cristmas trash they create new stuff

Winni

MRS_Claus.png (16.33 kB, 86x107 - viewed 816 times.)

Logged

BeniBela

Hero Member
Posts: 906

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #16 on: September 25, 2020, 12:31:20 am »

I have build my own enumerator for that: http://hg.benibela.de/bbutils/file/a94b6026f7d0/bbutils.pas#l509

Quote from: Martin_fr on September 24, 2020, 05:57:03 pm

Quote from: Blaazen on September 24, 2020, 03:26:17 pm
Here: https://wiki.freepascal.org/UTF8_Tools is wiki for UTF8Tools mentioned by Bart. There's download link at the bottom.
How up to date is it?

It seems to have a hardcoded datafile, and the file date is from 2008?
There are new chars added to the unicode standard every year.

Then it is from 2008.

I took that file and updated it in 2016. Guess it is time to update it again: http://hg.benibela.de/internettools/file/default/data/bbunicodeinfo.pas

Logged

https://www.benibela.de/index_en.html
https://github.com/benibela

Thaddy

Hero Member
Posts: 14357
Sensorship about opinions does not belong here.

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #17 on: September 25, 2020, 08:41:37 am »

Why this? Just curious...

Code: Pascal [Select][+]

const
  UTF8PROC_NULLTERM = 1 shl 0;

It has no impact on code generation, but it looks silly.

« Last Edit: September 25, 2020, 08:49:06 am by Thaddy »

Logged

Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

PascalDragon

Hero Member
Posts: 5462
Compiler Developer

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #18 on: September 25, 2020, 09:08:33 am »

Quote from: winni on September 24, 2020, 11:29:10 pm

Quote from: PascalDragon on September 24, 2020, 10:42:41 pm

The additions of the most recent Unicode version (namely 13.0 from March 2020) are listed here.

Yeah - Emojis are fighting for women's liberation!

It's about diversity. The Unicode consortium is working on getting the gender specific emojis done in both a variant of the other gender and a gender-neutral one. I personally - as a genderfluid person - appreciate that very much.

Logged

Hartmut

Hero Member
Posts: 749

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #19 on: September 25, 2020, 10:43:07 am »

Thanks again to all for your many replies. I will continue to answer them ony by one.

@Martin_fr: (from reply #4 and #6)
I had never heard of Codepoints before. Thanks for clarification. I will keep it in mind.

Quote from: Aidex on September 24, 2020, 02:33:22 pm

UCS4Char is a 32 bit char, UCS4String is an array of UCS4Char.
function UnicodeStringToUCS4String(const s: UnicodeString): UCS4String;
The function converts a Unicode string to a UCS-4 encoded string.
https://www.freepascal.org/docs-html/rtl/system/unicodestringtoucs4string.html

This sounds interesting, because it seems not to need Unit LazUTF8. But I could not get it to work. Your links says, that a widestring manager is required. After searching in google I found, that on Linux "uses cwstring" should do the job. For Windows I only found in a reliable time, that "uses LazUTF8" would automatically include a widestring manager. But both OS did not work (FPC 3.0.4). Here is my demo:

Code: Pascal [Select][+]

{$mode objfpc}{$H+}
 
{$IFDEF LINUX}
   uses cwstring; // install widestring manager
{$ENDIF}
{$IFDEF WINDOWS}
   uses LazUTF8;  // should include the widestring manager (?)
{$ENDIF}  
 
procedure test1; 
   var s: UnicodeString;
       z: UCS4String;
       i: integer;
   begin
   s:='AB äöüß ÄÖÜ 12';
   writeln('len(s)=', length(s));
   for i:=1 to length(s) do  write(ord(s[i]), ' ');
   writeln;
 
   z:= UnicodeStringToUCS4String(s);
   writeln('len(z)=', Length(z));
   for i:=0 to High(z) do  write(ord(z[i]), ' ');
   writeln;
   end;

The output shows (both for Windows and Linux):
len(s)=21 65 66 32 195 164 195 182 195 188 195 159 32 195 132 195 150 195 156 32 49 50 len(z)=22 65 66 32 195 164 195 182 195 188 195 159 32 195 132 195 150 195 156 32 49 50 0
Do you have an idea why it does not work?

Quote from: Blaazen on September 24, 2020, 03:26:17 pm

Here: https://wiki.freepascal.org/UTF8_Tools is wiki for UTF8Tools mentioned by Bart. There's download link at the bottom.

Thank you for that link. I had a look into the sources and found, that they all require Unit LazUTF8, which I want to avoid if possible (see reply #12) for such a primitive usage as I have now. But I wrote me a notice for this link for possible future needs.

Quote from: Martin_fr on September 24, 2020, 11:13:12 pm

When the installer is build, the ppu will contain the path to the unit source, as it is on the machine that was used to build the installer.
So unless you recompile those units, the compiler thinks that is the path where the unit is (or used to be).

Thanks for this info, I did not know before.

Logged

winni

Hero Member
Posts: 3197

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #20 on: September 25, 2020, 12:31:12 pm »

@Hartmut

Hi!

Don't get confused by all the infos about UTF8.

I show you a very simple way to separe the UTF8chars of a string into a
StringList.

Code: Pascal [Select][+]

uses ....LazUTF8, lclType;     
 
 
procedure TForm1.Button4Click(Sender: TObject);
const   MyUTF8 = 'Wir müssen uns nicht ärgern über UTF8! &#128064; ';
var St : TStringList;
    UChar : TUTF8Char;
    i : integer;
begin
St := TStringList.Create;
for i := 1 to Utf8Length(MyUTF8) do
   begin
     UChar := UTF8Copy (MyUTF8,i,1);
     St.add(UChar);
   end;
showMessage (St.Text);
St.Free;
end;
 
 
 

Winni

Logged

Martin_fr

Administrator
Hero Member
Posts: 9855
Debugger - SynEdit - and more

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #21 on: September 25, 2020, 01:05:43 pm »

Quote from: winni on September 25, 2020, 12:31:12 pm

I show you a very simple way to separe the UTF8chars of a string into a
StringList.

Code: Pascal [Select][+][-]
for i := 1 to Utf8Length(MyUTF8) do
begin
UChar := UTF8Copy (MyUTF8,i,1);

Fine for shorter strings...

But try that on longer strings. Lets say 100,000 bytes long. Takes 12 seconds on a I9 8600K @4.7Ghz (no debugging / O3). And that is without adding to the stringlist. Only doing 100,000 Utf8Copy.

Try it with 200,000 => 50 seconds.

Its O(n^2). It gets a lot slower when you increase the input size.

Something like this should do the work (getting codepoints / as Utf8Copy also gets codepoints)

Code: Pascal [Select][+]

 CurCharStart:= 1;
 while CurCharStart < Length(MyUTF8) do begin
   NextCharStart := CurCharStart + 1;
   while (NextCharStart < Length(MyUTF8)) and ((ord(MyUTF8[NextCharStart]) and $C0) = $80) do
     inc(NextCharStart);
   UChar := copy(MyUTF8, CurCharStart, NextCharStart - CurCharStart);
 
  // Process the codepoint in UChar
 
   CurCharStart := NextCharStart;
 end;
 

Logged

From the wiki: Ide Tools, Code completion and more / IDE cool features / Debugger Status

Hartmut

Hero Member
Posts: 749

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #22 on: September 25, 2020, 01:30:17 pm »

Thanks Winni for that demo. It's easy to understand, but also requires Unit LazUTF8, which I want to avoid (see reply #12) for such a primitive usage as I have now.
If nobody gets function system.UnicodeStringToUCS4String() from Aidex to work (see problem in reply #19), then I will create the 3 very simple UTF8-functions, which I only need now for my primitive usage.

Thanks Martin_fr for that improvement. If I will create my own functions, I thought about something like that. But I still hope, that someone finds out, why function system.UnicodeStringToUCS4String() from Aidex does not work in my case (see reply #19).

Logged

rvk

Hero Member
Posts: 6162

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #23 on: September 25, 2020, 01:51:39 pm »

Quote from: Hartmut on September 25, 2020, 10:43:07 am

The output shows (both for Windows and Linux):
len(s)=21 65 66 32 195 164 195 182 195 188 195 159 32 195 132 195 150 195 156 32 49 50 len(z)=22 65 66 32 195 164 195 182 195 188 195 159 32 195 132 195 150 195 156 32 49 50 0
Do you have an idea why it does not work?

UCS4String always has a 'hard' #0 termination (explicit terminating #0).
This is seen in UnicodeStringToUCS4String() and the called UCS4Encode().
(reslen is reset before filling the string and there is a hard #0 placed at the end)

You see that UCS4StringToUnicodeString and UCS4StringToWideString both strip that #0. If there was no #0 all these functions would fail.
and you see in UCS4Decode().

Code: Pascal [Select][+]

procedure UCS4Decode(const s: UCS4String; dest: PWideChar);
var
  i: sizeint;
  nc: UCS4Char;
begin
  for i:=0 to length(s)-2 do  { -2 because s contains explicit terminating #0 }
 

BTW. This is also the case in Delphi.
https://en.delphipraxis.net/topic/1820-ucs4strings/

« Last Edit: September 25, 2020, 01:58:33 pm by rvk »

Logged

Martin_fr

Administrator
Hero Member
Posts: 9855
Debugger - SynEdit - and more

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #24 on: September 25, 2020, 02:06:36 pm »

Quote from: Hartmut on September 25, 2020, 10:43:07 am

Code: Pascal [Select][+][-]
{$mode objfpc}{$H+}
var s: UnicodeString;
z: UCS4String;
i: integer;
begin
s:='AB äöüß ÄÖÜ 12';
z:= UnicodeStringToUCS4String(s);

The output shows (both for Windows and Linux):
len(s)=21 65 66 32 195 164 195 182 195 188 195 159 32 195 132 195 150 195 156 32 49 50 len(z)=22 65 66 32 195 164 195 182 195 188 195 159 32 195 132 195 150 195 156 32 49 50 0
Do you have an idea why it does not work?

https://wiki.freepascal.org/FPC_Unicode_support#UnicodeString.2FWideString

UnicodeString is not Utf8String

UnicodeString is a string with 16bit codeunits. (words)
Utf8String is a string with 8 bit codeunits (bytes)

Assinging
s:='AB äöüß ÄÖÜ 12';

AFAIK converts the string 'AB äöüß ÄÖÜ 12'; from the source codepage to Utf16.

I am not sure why the ä (195 164) is not converted to a single codepoint.
Probably you need to include
{$codepage utf8}
on top of your source.

UnicodeStringToUCS4String then eliminates surrogates, fitting them into a single ucs4 codepoint.

Combining codepoints are left in place.

Logged

From the wiki: Ide Tools, Code completion and more / IDE cool features / Debugger Status

winni

Hero Member
Posts: 3197

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #25 on: September 25, 2020, 03:13:55 pm »

Hi!

Martin_fr remark that very long strings in Lazarus are bloody low is right.
I noticed that as I wanted to read a json multipolygone with the borders of europe:

one line with > 1 Mio chars

But if you know the internals of UTF8 you can write a very short solution without needing some UTF8 units.

Code: Pascal [Select][+]

procedure TForm1.Button1Click(Sender: TObject);
const MyUTF : string = 'ÄÖÜ ²³¼½µ Test æſðđŋħ&#127137;test';
 
var s: string;
    i : integer= 0;
    len : integer;
    p : pchar;
begin
 
p := pchar(MyUTF);
while i < length(myUTF) do
 begin
   case ord(p^) of
            0..127  : len := 1;
            192..223: len := 2;
            224..239: len := 3;
            240..244: len := 4;
   end; // case
 setLength(s,len);
 move (p^,s[1],len);
 showMessage (s+' / '+IntToStr(len));
 inc(p,len);
 inc(i,len);
 end; //while
end;
 

As you can see the length of a UTF8char or codepoint is defined through the start byte.

There is no error checking done but I hope the aera of broken UTF8chars is over.

Winni

PS 🂡 is a product of this editor.
It is the Ace of Spades: RIP Lemmy Kilmister

« Last Edit: September 25, 2020, 03:19:14 pm by winni »

Logged

Hartmut

Hero Member
Posts: 749

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #26 on: September 25, 2020, 03:40:51 pm »

Thanks rvk for your post, but we misunderstood: the problem is not the additional "0" at the end. The result is wrong, because function UnicodeStringToUCS4String() should split the input into it's characters (not bytes). That means, an input of e.g. character "ä" = 2 Bytes = "195 164" should be converted into 1 value, not 2 values.

Thanks a lot Martin_fr for your reply. You are right, UnicodeString is not Utf8String, I did not pay attention to it. As you can see from the above output, string 's' obviously is in UTF8, not Unicode.

As recommended I added {$codepage utf8} to the top of my source. But after this string 's' contained Windows-charset (Ansi 1252?) - very strange (currently I'm on Linux):
len(s)=14 41 42 20 E4 F6 FC DF 20 C4 D6 DC 20 31 32 len(z)=15 41 42 20 E4 F6 FC DF 20 C4 D6 DC 20 31 32 00
Info: my sourcefile was already in UTF8.

Then I changed my code to:

Code: Pascal [Select][+]

...
var s0: UTF8String;
    s: UnicodeString;
begin
s0:='AB äöüß ÄÖÜ 12'; // should store as UTF8
s:=UnicodeString(s0); // should convert to Unicode to make UnicodeStringToUCS4String() happy
...

but the result again was:
len(s)=21 41 42 20 C3 A4 C3 B6 C3 BC C3 9F 20 C3 84 C3 96 C3 9C 20 31 32 len(z)=22 41 42 20 C3 A4 C3 B6 C3 BC C3 9F 20 C3 84 C3 96 C3 9C 20 31 32 00

Does anybody know, why above
s:=UnicodeString(s0);
does not convert from UTF8 to Unicode?

Logged

rvk

Hero Member
Posts: 6162

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #27 on: September 25, 2020, 04:09:21 pm »

Quote from: Hartmut on September 25, 2020, 03:40:51 pm

Does anybody know, why above
s:=UnicodeString(s0);
does not convert from UTF8 to Unicode?

String literals have always been a bit confusing to me in FPC.

Take a look at this:
https://wiki.freepascal.org/Unicode_Support_in_Lazarus#String_Literals
According to that page, assigning your literal string to a unicodestring will fail (without the {$codepage utf8}).

What version of FPC are you using??

Using const s : String = 'AB äöüß ÄÖÜ 12'; it might work better.
or even var s : String = 'AB äöüß ÄÖÜ 12';

« Last Edit: September 25, 2020, 04:13:28 pm by rvk »

Logged

PascalDragon

Hero Member
Posts: 5462
Compiler Developer

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #28 on: September 25, 2020, 04:36:09 pm »

Quote from: Hartmut on September 25, 2020, 03:40:51 pm

Then I changed my code to:
Code: Pascal [Select][+][-]
...
var s0: UTF8String;
s: UnicodeString;
begin
s0:='AB äöüß ÄÖÜ 12'; // should store as UTF8
s:=UnicodeString(s0); // should convert to Unicode to make UnicodeStringToUCS4String() happy
...
but the result again was:
len(s)=21 41 42 20 C3 A4 C3 B6 C3 BC C3 9F 20 C3 84 C3 96 C3 9C 20 31 32 len(z)=22 41 42 20 C3 A4 C3 B6 C3 BC C3 9F 20 C3 84 C3 96 C3 9C 20 31 32 00

Does anybody know, why above
s:=UnicodeString(s0);
does not convert from UTF8 to Unicode?

The following command line utility prints the correct data (I have stored it as UTF-8 with BOM):

Code: Pascal [Select][+]

program tunicode;
 
{$mode objfpc}{$H+}
{$codepage utf8}
 
var s: UnicodeString;
    u: UTF8String;
    z: UCS4String;
    i: integer;
begin
  u:='AB äöüß ÄÖÜ 12';
  s:=u;
  z:= UnicodeStringToUCS4String(s);
 
  Writeln(Length(s));
  Writeln(Length(z));
 
  for i := 1 to Length(s) do
    Write(HexStr(Ord(s[i]), 2), ' ');
  Writeln;
 
  for i := 0 to High(z) do
    Write(HexStr(Ord(z[i]), 8), ' ');
  Writeln;
end.

Output:

Code: [Select]

14
15
41 42 20 E4 F6 FC DF 20 C4 D6 DC 20 31 32
00000041 00000042 00000020 000000E4 000000F6 000000FC 000000DF 00000020 000000C4 000000D6 000000DC 00000020 00000031 00000032 00000000

Logged

rvk

Hero Member
Posts: 6162

Re: How to split UTF8-strings into it's characters (not bytes)?

« Reply #29 on: September 25, 2020, 04:41:29 pm »

Quote from: PascalDragon on September 25, 2020, 04:36:09 pm

u:='AB äöüß ÄÖÜ 12';

41 42 20 E4 F6 FC DF 20 C4 D6 DC 20 31 32

Which is strange because the middle 4 character (first after the space) are E4 F6 FC DF.
And those are not UTF-8 characters, are they?
Every character above hex $80 should have multiple bytes, shouldn't they?

https://en.wikipedia.org/wiki/UTF-8

Logged

Lazarus

Bookstore

Search

Recent

Author Topic: [SOLVED] How to split UTF8-strings into it's characters (not bytes)? (Read 10756 times)

winni

Re: How to split UTF8-strings into it's characters (not bytes)?

BeniBela

Re: How to split UTF8-strings into it's characters (not bytes)?

Thaddy

Re: How to split UTF8-strings into it's characters (not bytes)?

PascalDragon

Re: How to split UTF8-strings into it's characters (not bytes)?

Hartmut

Re: How to split UTF8-strings into it's characters (not bytes)?

winni

Re: How to split UTF8-strings into it's characters (not bytes)?

Martin_fr

Re: How to split UTF8-strings into it's characters (not bytes)?

Hartmut

Re: How to split UTF8-strings into it's characters (not bytes)?

rvk

Re: How to split UTF8-strings into it's characters (not bytes)?

Martin_fr

Re: How to split UTF8-strings into it's characters (not bytes)?

winni

Re: How to split UTF8-strings into it's characters (not bytes)?

Hartmut

Re: How to split UTF8-strings into it's characters (not bytes)?

rvk

Re: How to split UTF8-strings into it's characters (not bytes)?

PascalDragon

Re: How to split UTF8-strings into it's characters (not bytes)?

rvk

Re: How to split UTF8-strings into it's characters (not bytes)?

	Computer Math and Games in Pascal (preview)
	Lazarus Handbook