Recent

Author Topic: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?  (Read 2975 times)

winni

  • Hero Member
  • *****
  • Posts: 1909
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #30 on: September 25, 2020, 04:48:16 pm »
Hi!

Definitly wrong.

German äöüß ÄÖÜ are all in Latin 1 Supplement and start all with $C2 or $C3.
Something wrong.

Winni

wp

  • Hero Member
  • *****
  • Posts: 7642
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #31 on: September 25, 2020, 05:45:01 pm »
No, Martin is right:

Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button1Click(Sender: TObject);
  2. const
  3.   s1 = #$C3#$a4;
  4.   s2 = 'a' + #$CC#$88;
  5. begin
  6.   ShowMessage('Is ' + s1 + ' the same character as ' + s2 + '?');  
  7. end;  
« Last Edit: September 25, 2020, 05:47:24 pm by wp »
Mainly Lazarus trunk / fpc 3.2.0 / all 32-bit on Win-10, but many more...

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 6677
  • Debugger - SynEdit - and more
    • wiki
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #32 on: September 25, 2020, 06:49:52 pm »
No, Martin is right:

Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button1Click(Sender: TObject);
  2. const
  3.   s1 = #$C3#$a4;
  4.   s2 = 'a' + #$CC#$88;
  5. begin
  6.   ShowMessage('Is ' + s1 + ' the same character as ' + s2 + '?');  
  7. end;  
But that is about composition.

That has nothing to do, that for some reason the unicodestring (utf16) does not have the same codepoints as the utf8 string. It has the bytes (codeunits) of the utf8 string, all extended to words. But in utf16 those have a completely different meaning.

Hartmut

  • Sr. Member
  • ****
  • Posts: 439
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #33 on: September 25, 2020, 07:22:09 pm »
Take a look at this:
https://wiki.freepascal.org/Unicode_Support_in_Lazarus#String_Literals
According to that page, assigning your literal string to a unicodestring will fail (without the {$codepage utf8}).

What version of FPC are you using??

Using const s : String = 'AB äöüß ÄÖÜ 12'; it might work better.
or even var s : String = 'AB äöüß ÄÖÜ 12';
From your link, using {$codepage utf8} should be correct, as Martin_fr recommended too. I had tried it already (see reply #26), but after this string 's' contained a Windows-charset (Ansi 1252?), although I was currently on Linux.

I made all Tests with FPC 3.0.4 (on Windows 7 32-bit or Linux Ubuntu 18.04 64-bit).

I tried "const s : String = 'AB äöüß ÄÖÜ 12';" with and without {$codepage utf8} and then both times 's' had UTF8-charset and 'z' had Windows-charset (Ansi 1252?):
len(s)=21
41 42 20 C3 A4 C3 B6 C3 BC C3 9F 20 C3 84 C3 96 C3 9C 20 31 32
len(z)=15
41 42 20 E4 F6 FC DF 20 C4 D6 DC 20 31 32 00

To use "var s : String = 'AB äöüß ÄÖÜ 12';" made no difference.

Hello PascalDragon, I tried your demo (reply #28) and got the same results as you. But to be honest, I don't understand what you want to show me with that. Both 's' and 'z' show Bytes, not characters, which are in Windows-charset (Ansi 1252?), although I am currently on Linux. What I want is to split an UTF8-string into it's UTF8-characters.

@rvk (reply #29) and @Winni (reply #30):
What you see is not UTF8, this is Windows-charset (Ansi 1252?), as I wrote multiple times.

@wp (reply #31) and @Martin_fr (reply #32):
From my understanding this has nothing to do with the problem, that system.UnicodeStringToUCS4String() does not work :-)

I think we all have spent now (more than) enough time for this problem. Me about 2 days.
As noted before, now I will write the 3 very simple UTF8-functions, which I only need now for my primitive usage. Should not take more than 1 hour (including testing). Tomorrow I will report if I was successful.

Thanks a lot to all that you tried to help me and for your informations. Again I learned a lot.

Thaddy

  • Hero Member
  • *****
  • Posts: 10527
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #34 on: September 25, 2020, 08:45:26 pm »
Why did you not try with a supported version, like 3.2.0 instead of the unsupported 3.0.4?

BeniBela

  • Hero Member
  • *****
  • Posts: 761
    • homepage
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #35 on: September 25, 2020, 10:07:10 pm »
Thank you for that link. I had a look into the sources and found, that they all require Unit LazUTF8, which I want to avoid if possible (see reply #12) for such a primitive usage as I have now. But I wrote me a notice for this link for possible future needs.

My enumerator should not require LazUTF8


Why this? Just curious...
Code: Pascal  [Select][+][-]
  1. const
  2.   UTF8PROC_NULLTERM = 1 shl 0;
It has no impact on code generation, but it looks silly.

That is copied directly from there: https://github.com/JuliaStrings/utf8proc/blob/master/utf8proc.h#L146-L167

Hartmut

  • Sr. Member
  • ****
  • Posts: 439
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #36 on: September 26, 2020, 09:10:36 am »
Why did you not try with a supported version, like 3.2.0 instead of the unsupported 3.0.4?
I had tried with a 3.2.0 beta, but it made no difference, so I didn't mention it.

My enumerator should not require LazUTF8
I have had a look in both of your links (reply #16), but didn't understand much, what I saw and didn't find something, which looked to me that it could help me (don't know what an "enumerator" is and how this could solve my problem). I thought both links were about updating old datafiles.

Hartmut

  • Sr. Member
  • ****
  • Posts: 439
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #37 on: September 26, 2020, 09:23:37 am »
Because we could not find out, why system.UnicodeStringToUCS4String() did not work and because I did not want to accept so many problems and disadvantages of Unit LazUTF8 (see reply #12) - for such a primitive usage which I have now - I wrote the 3 very simple UTF8-functions which I only needed:
 - they work without Unit LazUTF8
 - they work on Windows and Linux (32-bit and 64-bit)
 - for me they are fast enough (my UTF8-strings are not longer than about 200 characters).
With this functions I have created my little compare utility and it works perfect.
 
Code: Pascal  [Select][+][-]
  1. // only unit system is needed
  2. type StrUTF8 = ansistring; {a separate type allows easy changes for experiments}
  3.      CharUTF8 = string[4]; {space for 1 UTF8-character}
  4.  
  5. function charLen_UTF8(c: char): integer;
  6.    {returns the length in bytes of an UTF8-character which starts with 'c'}
  7.    begin
  8.    case ord(c) shr 4 of
  9.       $C,$D: exit(2);
  10.       $E:    exit(3);
  11.       $F:    exit(4);
  12.       else   exit(1);
  13.    end;
  14.    end;
  15.  
  16. function length_UTF8(s: StrUTF8): PtrInt;
  17.    {returns the number of UTF8-characters, which UTF8-String 's' has}
  18.    var len,i: PtrInt;
  19.        a: integer;
  20.    begin
  21.    len:=0; i:=1;
  22.    while i <= length(s) do
  23.       begin
  24.       a:=charLen_UTF8(s[i]); {length in bytes of current UTF8-character}
  25.       if i+a-1 > length(s) then exit(len); {don't count incomplete characters}
  26.       inc(len);
  27.       inc(i,a);
  28.       end;
  29.    exit(len);
  30.    end;
  31.  
  32. function getChar_UTF8(s: StrUTF8; p: PtrInt): CharUTF8;
  33.    {returns UTF8-character with number 'p' out of UTF8-String 's' or empty
  34.     string, if p < 1 or p is too big}
  35.    var i,n: PtrInt;
  36.        a: integer;
  37.    begin
  38.    if p < 1 then exit('');
  39.  
  40.    i:=1; n:=0; {'n' counts already found UTF8-characters}
  41.    while i <= length(s) do
  42.       begin
  43.       a:=charLen_UTF8(s[i]); {length in bytes of current UTF8-character}
  44.       if i+a-1 > length(s) then exit(''); {without incomplete characters}
  45.       inc(n); {next valid UTF8-character was found}
  46.       if n=p then exit(copy(s,i,a));
  47.       inc(i,a);
  48.       end;
  49.  
  50.    exit(''); {if 'p' was too big}
  51.    end;
         
Thanks again a lot to all who tried to help me.
« Last Edit: September 26, 2020, 09:27:59 am by Hartmut »

PascalDragon

  • Hero Member
  • *****
  • Posts: 2278
  • Compiler Developer
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #38 on: September 26, 2020, 09:53:06 am »
Hello PascalDragon, I tried your demo (reply #28) and got the same results as you. But to be honest, I don't understand what you want to show me with that. Both 's' and 'z' show Bytes, not characters, which are in Windows-charset (Ansi 1252?), although I am currently on Linux. What I want is to split an UTF8-string into it's UTF8-characters.

If you look at the assembly code you'll see that the string itself is encoded as UTF-8 and it's already correctly converted to UTF-16 upon the assignment. In the case of your specific example nothing obvious happens, because all characters you used are code points <= $FF.

Because we could not find out, why system.UnicodeStringToUCS4String() did not work and because I did not want to accept so many problems and disadvantages of Unit LazUTF8 (see reply #12) - for such a primitive usage which I have now - I wrote the 3 very simple UTF8-functions which I only needed:

No, it's not UnicodeSTringToUCS4String that did not work, it worked absolutely correctly with the input it got. What did not work is your input string. It was not correctly converted to a valid UTF-16 string and that is the problem.

Hartmut

  • Sr. Member
  • ****
  • Posts: 439
Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?
« Reply #39 on: September 26, 2020, 10:42:09 am »
Hello PascalDragon, I tried your demo (reply #28) and got the same results as you. But to be honest, I don't understand what you want to show me with that. Both 's' and 'z' show Bytes, not characters, which are in Windows-charset (Ansi 1252?), although I am currently on Linux. What I want is to split an UTF8-string into it's UTF8-characters.

If you look at the assembly code you'll see that the string itself is encoded as UTF-8 and it's already correctly converted to UTF-16 upon the assignment. In the case of your specific example nothing obvious happens, because all characters you used are code points <= $FF.

I'm not familiar reading assembly code. But you wrote in reply #28 "I have stored it as UTF-8 with BOM". That lets me guess, that you wanted to add an attachment, but there is no.

Because we could not find out, why system.UnicodeStringToUCS4String() did not work and because I did not want to accept so many problems and disadvantages of Unit LazUTF8 (see reply #12) - for such a primitive usage which I have now - I wrote the 3 very simple UTF8-functions which I only needed:

No, it's not UnicodeSTringToUCS4String that did not work, it worked absolutely correctly with the input it got. What did not work is your input string. It was not correctly converted to a valid UTF-16 string and that is the problem.

I believe that you are right, that UnicodeSTringToUCS4String() "only" got the wrong input. But you see, how many experiments we have tried, to solve the problem, but nobody found a solution which worked.

PascalDragon

  • Hero Member
  • *****
  • Posts: 2278
  • Compiler Developer
Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?
« Reply #40 on: September 26, 2020, 12:06:49 pm »
Hello PascalDragon, I tried your demo (reply #28) and got the same results as you. But to be honest, I don't understand what you want to show me with that. Both 's' and 'z' show Bytes, not characters, which are in Windows-charset (Ansi 1252?), although I am currently on Linux. What I want is to split an UTF8-string into it's UTF8-characters.

If you look at the assembly code you'll see that the string itself is encoded as UTF-8 and it's already correctly converted to UTF-16 upon the assignment. In the case of your specific example nothing obvious happens, because all characters you used are code points <= $FF.

I'm not familiar reading assembly code. But you wrote in reply #28 "I have stored it as UTF-8 with BOM". That lets me guess, that you wanted to add an attachment, but there is no.

No, I had not intended to attach a project. And as you wrote it worked for you, so I don't need to attach one anyway.

Because we could not find out, why system.UnicodeStringToUCS4String() did not work and because I did not want to accept so many problems and disadvantages of Unit LazUTF8 (see reply #12) - for such a primitive usage which I have now - I wrote the 3 very simple UTF8-functions which I only needed:

No, it's not UnicodeSTringToUCS4String that did not work, it worked absolutely correctly with the input it got. What did not work is your input string. It was not correctly converted to a valid UTF-16 string and that is the problem.

I believe that you are right, that UnicodeSTringToUCS4String() "only" got the wrong input. But you see, how many experiments we have tried, to solve the problem, but nobody found a solution which worked.

I have attached an example LCL project that shows the important point:
- either your file needs to be stored as “UTF-8 with BOM” (you need to do a right click in the editor, go to “File Settings” (or similar, I'm using German) and then “Character Encoding”, confirm the dialog to change the file)
- or you need to store it as “UTF-8” and add {$codepage utf8}

In both cases the constant string data will be stored as UTF-16 if you simply assign it to a UnicodeString (or as UTF-8 data if you assign it to a String or UTF8String).

wp

  • Hero Member
  • *****
  • Posts: 7642
Re: How to split UTF8-strings into it's characters (not bytes)?
« Reply #41 on: September 26, 2020, 12:13:28 pm »
With Unit LazUTF8 I faced a lot of problems and disadvantages in the past. Some Examples I remember immediately:
 - on Windows in console programs adding Unit LazUTF8 changes the charset of the filenames reported by sysutils.FindFirst() and FindNext() and the charset of the results of ParamStr() and the results of readln(). Without Unit LazUTF8 they return Windows-Charset (Ansi 1252?) / with Unit LazUTF8 they return UTF8. This difference makes live not easy.
 - during decades I have written a couple of common libraries for console programs, which partly have problems with this differing charsets, so I get wrong results with Unit LazUTF8 because of the changed charset
 - for a couple of programs I still use FPC 2.6.4 (with the same common libraries), but in FPC 2.6.4 obove functions never return UTF8 on Windows
 - Windows-charset generally is much easier than UTF8 (as we see now), because each char is only 1 byte long
 - in some cases (but not always) I had problems, that writeln() for an 'ansistring' showed damaged characters for Ä Ö Ü ä ö ü ß after I added Unit LazUTF8.
And there were more issues, which I only don't remember in a sudden. So I want to avoid Unit LazUTF8, especially in console programs, wherever possible.
To be honest I do not fully understand your description, but you seem to have issues with usage of LazUTF8. I cannot imagine that LazUTF8 is the source of such errors; I use it in many projects and don't have any problems with it. But the one thing that I learned while trying to understand many issues with UTF8 is that one a conversion at the wrong place can cause unrecoverable errors. So, please check your code and make sure that every string conversion is in place and needed. And don't expect programs written for fpc 3.x to work correctly with 2.6.4 or older, because v3 introduces a massive change in string handling.

I am attaching a demo for FindFirst which finds a file "testäöü.txt" which should cause issues according to your description. Me me, it does not. The filename is displayed in the console correctly with fpc 3.2 as well as 2.6.4 (after conversion). (NOTE: my windows cp is 1252. If yours is different the filename may appear differently than shown here). And it does not make a difference whether LazUTF8 is linked in or not (activate/deactivate the define USE_LAZUTF8).
« Last Edit: September 26, 2020, 12:15:29 pm by wp »
Mainly Lazarus trunk / fpc 3.2.0 / all 32-bit on Win-10, but many more...

Hartmut

  • Sr. Member
  • ****
  • Posts: 439
Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?
« Reply #42 on: September 26, 2020, 02:09:16 pm »
I have attached an example LCL project that shows the important point:
- either your file needs to be stored as “UTF-8 with BOM” (you need to do a right click in the editor, go to “File Settings” (or similar, I'm using German) and then “Character Encoding”, confirm the dialog to change the file)
- or you need to store it as “UTF-8” and add {$codepage utf8}

Thanks PascalDragon for your demo. I attached its output as screenshot1. I see exactly the same result as in your reply #28 and the output again is not UTF8, which is, what I need.
BTW: my source had already been stored as UTF8 and I had used {$codepage utf8} already before as mentioned in reply #26. And in my real world the UTF8-strings which I want to compare are no constants, they are returned in UTF8 from a function.

Quote
In both cases the constant string data will be stored as UTF-16 if you simply assign it to a UnicodeString (or as UTF-8 data if you assign it to a String or UTF8String).

Then I made 2 more tests after changing var 'u' to String and to UTF8String. Their outputs were identical, I attached it as screenshot2. Now we see, that the input shows the single bytes of UTF8, but the output is identical to screenshot1 - nothing has changed. That's why I think that I can't use UnicodeSTringToUCS4String() in my case.

What I needed was to split an UTF8-string into it's UTF8-characters, not in something else. Therefore I have written my own functions yesterday (see reply # 37), the problem is solved.

@wp: Thanks for your reply, I will check it and report later.

Hartmut

  • Sr. Member
  • ****
  • Posts: 439
Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?
« Reply #43 on: September 26, 2020, 06:09:48 pm »
And don't expect programs written for fpc 3.x to work correctly with 2.6.4 or older, because v3 introduces a massive change in string handling.

Oh, there we misunderstood: I do not try to run programs with FPC 2.6.4, which have been written for FPC 3.x. What I have is a couple of "common libraries", which exist only once, and I use these both for all new programs with FPC 3.x and for a couple of older programs, which I still compile with FPC 2.6.4, because they have originally been written for 2.6.4, and (until now) I did not update them to 3.x.

Quote
I am attaching a demo for FindFirst which finds a file "testäöü.txt" which should cause issues according to your description. Me me, it does not. The filename is displayed in the console correctly with fpc 3.2 as well as 2.6.4 (after conversion). (NOTE: my windows cp is 1252. If yours is different the filename may appear differently than shown here). And it does not make a difference whether LazUTF8 is linked in or not (activate/deactivate the define USE_LAZUTF8).

Thank you for your demo and the time you invested to help me. I found out, that to activate/deactivate the define USE_LAZUTF8 made no difference. I found the reason in, that Unit LConvEncoding is used, which contains:
Code: Pascal  [Select][+][-]
  1. uses SysUtils, Classes, dos, LazUTF8
so LazUTF8 was always included and you saw no difference.

After deactivating units LazUTF8 and LConvEncoding I had exactly the difference, which I was talking about. I added some lines to make the difference clearer:
Code: Pascal  [Select][+][-]
  1. begin
  2. ...
  3.   if FindFirst('test*.*', faAnyFile, SR) = 0 then // to catch file 'testäöü.txt'
  4.   begin
  5.     repeat
  6.       WriteLn('Filename as returned by FindFirst: ', SR.Name);
  7.       for i:=1 to length(SR.Name) do  write(HexStr(ord(SR.Name[i]),2), ' ');
  8.       writeln('len=', length(SR.Name));
  9. ...      
  10.      until FindNext(SR) <> 0;
  11.     FindClose(SR);
  12.   end;
  13. ...  
  14. end.

Here is the output, depending on FPC version and whether LazUTF8 (including LConvEncoding) was enabled or not (all on Windows 7 32-bit):
FPC   LazUTF8  Output
---------------------
3.0.4   yes    74 65 73 74 5F C3 A4 C3 B6 C3 BC 2E 74 78 74 len=15
3.0.4   no     74 65 73 74 5F E4 F6 FC 2E 74 78 74 len=12
2.6.4   yes    74 65 73 74 5F E4 F6 FC 2E 74 78 74 len=12
2.6.4   no     74 65 73 74 5F E4 F6 FC 2E 74 78 74 len=12 


We see, that in FPC 3.x the same FindFirst() function changes the charset of the reported filenames between UTF8-charset and Windows-charset, whether you include Unit LazUTF8 or not. All procedures and functions which deal with filenames switch the same. And as I wrote, the charsets of the results of ParamStr() and the results of readln() play the same game. And it might be more (which I only not remember now).

I hope that you now will believe that my list of problems and disadvantages with Unit LazUTF8 is real.

nanobit

  • Jr. Member
  • **
  • Posts: 86
Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?
« Reply #44 on: September 26, 2020, 06:34:44 pm »
You should update to FPC 3.2.

 

TinyPortal © 2005-2018