Recent

Author Topic: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?  (Read 10490 times)

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #15 on: September 24, 2020, 11:29:10 pm »

The additions of the most recent Unicode version (namely 13.0 from March 2020) are listed here.

Yeah - Emojis are fighting for women's liberation!

Now we got MRS SANTA CLAUS  !!!

Instead of abolishing the cristmas trash they create new stuff


Winni

BeniBela

  • Hero Member
  • *****
  • Posts: 905
    • homepage
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #16 on: September 25, 2020, 12:31:20 am »
I have build my own enumerator for that: http://hg.benibela.de/bbutils/file/a94b6026f7d0/bbutils.pas#l509

Here: https://wiki.freepascal.org/UTF8_Tools is wiki for UTF8Tools mentioned by Bart. There's download link at the bottom.
How up to date is it?

It seems to have a hardcoded datafile, and the file date is from 2008?
There are new chars added to the unicode standard every year.

Then it is from 2008.

I took that file and updated it in 2016. Guess it is time to update it again: http://hg.benibela.de/internettools/file/default/data/bbunicodeinfo.pas


Thaddy

  • Hero Member
  • *****
  • Posts: 14157
  • Probably until I exterminate Putin.
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #17 on: September 25, 2020, 08:41:37 am »
Why this? Just curious...
Code: Pascal  [Select][+][-]
  1. const
  2.   UTF8PROC_NULLTERM = 1 shl 0;
It has no impact on code generation, but it looks silly.
« Last Edit: September 25, 2020, 08:49:06 am by Thaddy »
Specialize a type, not a var.

PascalDragon

  • Hero Member
  • *****
  • Posts: 5444
  • Compiler Developer
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #18 on: September 25, 2020, 09:08:33 am »

The additions of the most recent Unicode version (namely 13.0 from March 2020) are listed here.

Yeah - Emojis are fighting for women's liberation!

It's about diversity. The Unicode consortium is working on getting the gender specific emojis done in both a variant of the other gender and a gender-neutral one. I personally - as a genderfluid person - appreciate that very much.

Hartmut

  • Hero Member
  • *****
  • Posts: 739
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #19 on: September 25, 2020, 10:43:07 am »
Thanks again to all for your many replies. I will continue to answer them ony by one.

@Martin_fr: (from reply #4 and #6)
I had never heard of Codepoints before. Thanks for clarification. I will keep it in mind.


UCS4Char is a 32 bit char, UCS4String is an array of UCS4Char.
function UnicodeStringToUCS4String(const s: UnicodeString): UCS4String;
The function converts a Unicode string to a UCS-4 encoded string.
https://www.freepascal.org/docs-html/rtl/system/unicodestringtoucs4string.html
This sounds interesting, because it seems not to need Unit LazUTF8. But I could not get it to work. Your links says, that a widestring manager is required. After searching in google I found, that on Linux "uses cwstring" should do the job. For Windows I only found in a reliable time, that "uses LazUTF8" would automatically include a widestring manager. But both OS did not work (FPC 3.0.4). Here is my demo:

Code: Pascal  [Select][+][-]
  1. {$mode objfpc}{$H+}
  2.  
  3. {$IFDEF LINUX}
  4.    uses cwstring; // install widestring manager
  5. {$ENDIF}
  6. {$IFDEF WINDOWS}
  7.    uses LazUTF8;  // should include the widestring manager (?)
  8. {$ENDIF}  
  9.  
  10. procedure test1;
  11.    var s: UnicodeString;
  12.        z: UCS4String;
  13.        i: integer;
  14.    begin
  15.    s:='AB äöüß ÄÖÜ 12';
  16.    writeln('len(s)=', length(s));
  17.    for i:=1 to length(s) do  write(ord(s[i]), ' ');
  18.    writeln;
  19.  
  20.    z:= UnicodeStringToUCS4String(s);
  21.    writeln('len(z)=', Length(z));
  22.    for i:=0 to High(z) do  write(ord(z[i]), ' ');
  23.    writeln;
  24.    end;

The output shows (both for Windows and Linux):
len(s)=21
65 66 32 195 164 195 182 195 188 195 159 32 195 132 195 150 195 156 32 49 50
len(z)=22
65 66 32 195 164 195 182 195 188 195 159 32 195 132 195 150 195 156 32 49 50 0

Do you have an idea why it does not work?


Here: https://wiki.freepascal.org/UTF8_Tools is wiki for UTF8Tools mentioned by Bart. There's download link at the bottom.
Thank you for that link. I had a look into the sources and found, that they all require Unit LazUTF8, which I want to avoid if possible (see reply #12) for such a primitive usage as I have now. But I wrote me a notice for this link for possible future needs.


When the installer is build, the ppu will contain the path to the unit source, as it is on the machine that was used to build the installer.
So unless you recompile those units, the compiler thinks that is the path where the unit is (or used to be).
Thanks for this info, I did not know before.

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #20 on: September 25, 2020, 12:31:12 pm »
@Hartmut

Hi!

Don't get  confused by all the infos about UTF8.

I show you a very simple way to separe the UTF8chars of a string into a
StringList.

Code: Pascal  [Select][+][-]
  1. uses ....LazUTF8, lclType;    
  2.  
  3.  
  4. procedure TForm1.Button4Click(Sender: TObject);
  5. const   MyUTF8 = 'Wir müssen uns nicht ärgern über UTF8! 👀 ';
  6. var St : TStringList;
  7.     UChar : TUTF8Char;
  8.     i : integer;
  9. begin
  10. St := TStringList.Create;
  11. for i := 1 to Utf8Length(MyUTF8) do
  12.    begin
  13.      UChar := UTF8Copy (MyUTF8,i,1);
  14.      St.add(UChar);
  15.    end;
  16. showMessage (St.Text);
  17. St.Free;
  18. end;
  19.  
  20.  
  21.  

Winni

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9754
  • Debugger - SynEdit - and more
    • wiki
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #21 on: September 25, 2020, 01:05:43 pm »
I show you a very simple way to separe the UTF8chars of a string into a
StringList.

Code: Pascal  [Select][+][-]
  1. for i := 1 to Utf8Length(MyUTF8) do
  2.    begin
  3.      UChar := UTF8Copy (MyUTF8,i,1);
  4.  

Fine for shorter strings...

But try that on longer strings. Lets say 100,000 bytes long. Takes 12 seconds on a I9 8600K @4.7Ghz (no debugging / O3). And that is without adding to the stringlist. Only doing 100,000 Utf8Copy.

Try it with 200,000 => 50 seconds.

Its O(n^2). It gets a lot slower when you increase the input size.


Something like this should do the work (getting codepoints / as Utf8Copy also gets codepoints)

Code: Pascal  [Select][+][-]
  1.  CurCharStart:= 1;
  2.  while CurCharStart < Length(MyUTF8) do begin
  3.    NextCharStart := CurCharStart + 1;
  4.    while (NextCharStart < Length(MyUTF8)) and ((ord(MyUTF8[NextCharStart]) and $C0) = $80) do
  5.      inc(NextCharStart);
  6.    UChar := copy(MyUTF8, CurCharStart, NextCharStart - CurCharStart);
  7.  
  8.   // Process the codepoint in UChar
  9.  
  10.    CurCharStart := NextCharStart;
  11.  end;
  12.  

Hartmut

  • Hero Member
  • *****
  • Posts: 739
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #22 on: September 25, 2020, 01:30:17 pm »
Thanks Winni for that demo. It's easy to understand, but also requires Unit LazUTF8, which I want to avoid (see reply #12) for such a primitive usage as I have now. 
If nobody gets function system.UnicodeStringToUCS4String() from Aidex to work (see problem in reply #19), then I will create the 3 very simple UTF8-functions, which I only need now for my primitive usage.

Thanks Martin_fr for that improvement. If I will create my own functions, I thought about something like that. But I still hope, that someone finds out, why function system.UnicodeStringToUCS4String() from Aidex does not work in my case (see reply #19).

rvk

  • Hero Member
  • *****
  • Posts: 6056
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #23 on: September 25, 2020, 01:51:39 pm »
The output shows (both for Windows and Linux):
len(s)=21
65 66 32 195 164 195 182 195 188 195 159 32 195 132 195 150 195 156 32 49 50
len(z)=22
65 66 32 195 164 195 182 195 188 195 159 32 195 132 195 150 195 156 32 49 50 0

Do you have an idea why it does not work?
UCS4String always has a 'hard' #0 termination (explicit terminating #0).
This is seen in UnicodeStringToUCS4String() and the called UCS4Encode().
(reslen is reset before filling the string and there is a hard #0 placed at the end)

You see that UCS4StringToUnicodeString and UCS4StringToWideString both strip that #0. If there was no #0 all these functions would fail.
and you see in UCS4Decode().
Code: Pascal  [Select][+][-]
  1. procedure UCS4Decode(const s: UCS4String; dest: PWideChar);
  2. var
  3.   i: sizeint;
  4.   nc: UCS4Char;
  5. begin
  6.   for i:=0 to length(s)-2 do  { -2 because s contains explicit terminating #0 }
  7.  

BTW. This is also the case in Delphi.
https://en.delphipraxis.net/topic/1820-ucs4strings/
« Last Edit: September 25, 2020, 01:58:33 pm by rvk »

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9754
  • Debugger - SynEdit - and more
    • wiki
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #24 on: September 25, 2020, 02:06:36 pm »
Code: Pascal  [Select][+][-]
  1. {$mode objfpc}{$H+}
  2.    var s: UnicodeString;
  3.        z: UCS4String;
  4.        i: integer;
  5.    begin
  6.    s:='AB äöüß ÄÖÜ 12';
  7.    z:= UnicodeStringToUCS4String(s);
  8.  

The output shows (both for Windows and Linux):
len(s)=21
65 66 32 195 164 195 182 195 188 195 159 32 195 132 195 150 195 156 32 49 50
len(z)=22
65 66 32 195 164 195 182 195 188 195 159 32 195 132 195 150 195 156 32 49 50 0

Do you have an idea why it does not work?

https://wiki.freepascal.org/FPC_Unicode_support#UnicodeString.2FWideString

UnicodeString is not Utf8String

UnicodeString is a string with 16bit codeunits. (words)
Utf8String is a string with 8 bit codeunits (bytes)

Assinging
   s:='AB äöüß ÄÖÜ 12';

AFAIK converts the string  'AB äöüß ÄÖÜ 12'; from the source codepage to Utf16.

I am not sure why the ä (195 164) is not converted to a single codepoint.
Probably you need to include
  {$codepage utf8}
on top of your source.

UnicodeStringToUCS4String then eliminates surrogates, fitting them into a single ucs4 codepoint.

Combining codepoints are left in place.

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #25 on: September 25, 2020, 03:13:55 pm »
Hi!

Martin_fr remark that very long strings in Lazarus are bloody low is right.
I noticed that as I wanted to read a json multipolygone with  the borders of europe:

one line with > 1 Mio chars

But if you know the internals of UTF8 you can write a very short solution without needing some UTF8 units.

Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button1Click(Sender: TObject);
  2. const MyUTF : string = 'ÄÖÜ ²³¼½µ Test æſðđŋħ&#127137;test';
  3.  
  4. var s: string;
  5.     i : integer= 0;
  6.     len : integer;
  7.     p : pchar;
  8. begin
  9.  
  10. p := pchar(MyUTF);
  11. while i < length(myUTF) do
  12.  begin
  13.    case ord(p^) of
  14.             0..127  : len := 1;
  15.             192..223: len := 2;
  16.             224..239: len := 3;
  17.             240..244: len := 4;
  18.    end; // case
  19.  setLength(s,len);
  20.  move (p^,s[1],len);
  21.  showMessage (s+' / '+IntToStr(len));
  22.  inc(p,len);
  23.  inc(i,len);
  24.  end; //while
  25. end;
  26.  

As you can see the length of a UTF8char or codepoint is defined through the start byte. 

There is no error checking done but I hope the aera of broken UTF8chars is over.

Winni

PS &#127137; is a product of this editor.
It is the Ace of Spades: RIP Lemmy Kilmister

 
« Last Edit: September 25, 2020, 03:19:14 pm by winni »

Hartmut

  • Hero Member
  • *****
  • Posts: 739
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #26 on: September 25, 2020, 03:40:51 pm »
Thanks rvk for your post, but we misunderstood: the problem is not the additional "0" at the end. The result is wrong, because function UnicodeStringToUCS4String() should split the input into it's characters (not bytes). That means, an input of e.g. character "ä" = 2 Bytes = "195 164" should be converted into 1 value, not 2 values.

Thanks a lot Martin_fr for your reply. You are right, UnicodeString is not Utf8String, I did not pay attention to it. As you can see from the above output, string 's' obviously is in UTF8, not Unicode.

As recommended I added {$codepage utf8} to the top of my source. But after this string 's' contained Windows-charset (Ansi 1252?) - very strange (currently I'm on Linux):
len(s)=14
41 42 20 E4 F6 FC DF 20 C4 D6 DC 20 31 32
len(z)=15
41 42 20 E4 F6 FC DF 20 C4 D6 DC 20 31 32 00

Info: my sourcefile was already in UTF8.

Then I changed my code to:
Code: Pascal  [Select][+][-]
  1. ...
  2. var s0: UTF8String;
  3.     s: UnicodeString;
  4. begin
  5. s0:='AB äöüß ÄÖÜ 12'; // should store as UTF8
  6. s:=UnicodeString(s0); // should convert to Unicode to make UnicodeStringToUCS4String() happy
  7. ...
but the result again was:
len(s)=21
41 42 20 C3 A4 C3 B6 C3 BC C3 9F 20 C3 84 C3 96 C3 9C 20 31 32
len(z)=22
41 42 20 C3 A4 C3 B6 C3 BC C3 9F 20 C3 84 C3 96 C3 9C 20 31 32 00


Does anybody know, why above
   s:=UnicodeString(s0);
does not convert from UTF8 to Unicode?

rvk

  • Hero Member
  • *****
  • Posts: 6056
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #27 on: September 25, 2020, 04:09:21 pm »
Does anybody know, why above
   s:=UnicodeString(s0);
does not convert from UTF8 to Unicode?
String literals have always been a bit confusing to me in FPC.

Take a look at this:
https://wiki.freepascal.org/Unicode_Support_in_Lazarus#String_Literals
According to that page, assigning your literal string to a unicodestring will fail (without the {$codepage utf8}).

What version of FPC are you using??

Using const s : String = 'AB äöüß ÄÖÜ 12'; it might work better.
or even var s : String = 'AB äöüß ÄÖÜ 12';
« Last Edit: September 25, 2020, 04:13:28 pm by rvk »

PascalDragon

  • Hero Member
  • *****
  • Posts: 5444
  • Compiler Developer
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #28 on: September 25, 2020, 04:36:09 pm »
Then I changed my code to:
Code: Pascal  [Select][+][-]
  1. ...
  2. var s0: UTF8String;
  3.     s: UnicodeString;
  4. begin
  5. s0:='AB äöüß ÄÖÜ 12'; // should store as UTF8
  6. s:=UnicodeString(s0); // should convert to Unicode to make UnicodeStringToUCS4String() happy
  7. ...
but the result again was:
len(s)=21
41 42 20 C3 A4 C3 B6 C3 BC C3 9F 20 C3 84 C3 96 C3 9C 20 31 32
len(z)=22
41 42 20 C3 A4 C3 B6 C3 BC C3 9F 20 C3 84 C3 96 C3 9C 20 31 32 00


Does anybody know, why above
   s:=UnicodeString(s0);
does not convert from UTF8 to Unicode?

The following command line utility prints the correct data (I have stored it as UTF-8 with BOM):

Code: Pascal  [Select][+][-]
  1. program tunicode;
  2.  
  3. {$mode objfpc}{$H+}
  4. {$codepage utf8}
  5.  
  6. var s: UnicodeString;
  7.     u: UTF8String;
  8.     z: UCS4String;
  9.     i: integer;
  10. begin
  11.   u:='AB äöüß ÄÖÜ 12';
  12.   s:=u;
  13.   z:= UnicodeStringToUCS4String(s);
  14.  
  15.   Writeln(Length(s));
  16.   Writeln(Length(z));
  17.  
  18.   for i := 1 to Length(s) do
  19.     Write(HexStr(Ord(s[i]), 2), ' ');
  20.   Writeln;
  21.  
  22.   for i := 0 to High(z) do
  23.     Write(HexStr(Ord(z[i]), 8), ' ');
  24.   Writeln;
  25. end.

Output:

Code: [Select]
14
15
41 42 20 E4 F6 FC DF 20 C4 D6 DC 20 31 32
00000041 00000042 00000020 000000E4 000000F6 000000FC 000000DF 00000020 000000C4 000000D6 000000DC 00000020 00000031 00000032 00000000

rvk

  • Hero Member
  • *****
  • Posts: 6056
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #29 on: September 25, 2020, 04:41:29 pm »
  u:='AB äöüß ÄÖÜ 12';

41 42 20 E4 F6 FC DF 20 C4 D6 DC 20 31 32
Which is strange because the middle 4 character (first after the space) are E4 F6 FC DF.
And those are not UTF-8 characters, are they?
Every character above hex $80 should have multiple bytes, shouldn't they?

https://en.wikipedia.org/wiki/UTF-8

 

TinyPortal © 2005-2018