AFAIK UTF8-characters can have a length of 1 to 4 bytes. Correct?Correct
Is there a predefined type to store 1 UTF8-character? Or must I use something like string[4]? The only thing I found is 'system.WideChar', but this can hold only 2 bytes.Yes. CP_UTF8 type for AnsiString. It is something like this:
I want to split an UTF8-string into it's characters (not bytes). Is there a predefined function for that?The simple AnsiString type has split functionality provided by sysutils (syshelph.inc). I am not quite sure if that is supposed to work for the above CP_UTF8.
I want to split an UTF8-string into it's characters (not bytes). Is there a predefined function for that?Use the enumerator in unit LazUnicode which steps from code point to code point. Something like this: (note that "ch" is a string, not a char!):
I want to compare 2 UTF8-strings, character by character (not byte by byte).There is a UTF8-string compare function in unit LazUTF8: UTF8CompareStr(s1, s2).
Is there a predefined function, which returns the n'th character (not byte) of an UTF8-string?You probably mean UTF8Pos() (in unit LazUTF8):
I want to compare 2 UTF8-strings, character by character (not byte by byte). Therefore I want to iterate a loop to walk through the 2 UTF8-strings, cutting each character and then compare them.
UCS4Char is a 32 bit char, UCS4String is an array of UCS4Char.
Here: https://wiki.freepascal.org/UTF8_Tools (https://wiki.freepascal.org/UTF8_Tools) is wiki for UTF8Tools mentioned by Bart. There's download link at the bottom.How up to date is it?
Honestly, I doubt they added something useful. They add mostly "Emoji symbols and symbol modifiers for implementing skin tone diversity" (from their web from 2015).Well, you gave the answer yourself: "skin tone diversity"
So: "Why isn't there a font that contains all Unicode glyphs?", because that's been technically impossible since 2001.
/usr/share/lazarus/1.8.4/lcl/units/x86_64-linux/wsmenus.o: In Funktion »REGISTERMENUITEM«:
/home/mattias/tmp/lazarus-project1.8.4/lazarus-project_build/usr/share/lazarus/1.8.4/lcl//widgetset/wsmenus.pp:221: Warnung: undefinierter Verweis auf »WSRegisterMenuItem«
/usr/share/lazarus/1.8.4/lcl/units/x86_64-linux/wsmenus.o: In Funktion »REGISTERMENU«:
/home/mattias/tmp/lazarus-project1.8.4/lazarus-project_build/usr/share/lazarus/1.8.4/lcl//widgetset/wsmenus.pp:232: Warnung: undefinierter Verweis auf »WSRegisterMenu«
Strange is, that I don't have a folder /home/mattias/ and never had.@ There are new chars added to the unicode standard every year.
Honestly, I doubt they added something useful. They add mostly "Emoji symbols and symbol modifiers for implementing skin tone diversity" (from their web from 2015).
Strange is, that I don't have a folder /home/mattias/ and never had.But Mattias has....
The additions of the most recent Unicode version (namely 13.0 from March 2020) are listed here (https://unicode.org/versions/Unicode13.0.0/).
Here: https://wiki.freepascal.org/UTF8_Tools (https://wiki.freepascal.org/UTF8_Tools) is wiki for UTF8Tools mentioned by Bart. There's download link at the bottom.How up to date is it?
It seems to have a hardcoded datafile, and the file date is from 2008?
There are new chars added to the unicode standard every year.
The additions of the most recent Unicode version (namely 13.0 from March 2020) are listed here (https://unicode.org/versions/Unicode13.0.0/).
Yeah - Emojis are fighting for women's liberation!
UCS4Char is a 32 bit char, UCS4String is an array of UCS4Char.This sounds interesting, because it seems not to need Unit LazUTF8. But I could not get it to work. Your links says, that a widestring manager is required. After searching in google I found, that on Linux "uses cwstring" should do the job. For Windows I only found in a reliable time, that "uses LazUTF8" would automatically include a widestring manager. But both OS did not work (FPC 3.0.4). Here is my demo:
function UnicodeStringToUCS4String(const s: UnicodeString): UCS4String;
The function converts a Unicode string to a UCS-4 encoded string.
https://www.freepascal.org/docs-html/rtl/system/unicodestringtoucs4string.html
Here: https://wiki.freepascal.org/UTF8_Tools (https://wiki.freepascal.org/UTF8_Tools) is wiki for UTF8Tools mentioned by Bart. There's download link at the bottom.Thank you for that link. I had a look into the sources and found, that they all require Unit LazUTF8, which I want to avoid if possible (see reply #12) for such a primitive usage as I have now. But I wrote me a notice for this link for possible future needs.
When the installer is build, the ppu will contain the path to the unit source, as it is on the machine that was used to build the installer.Thanks for this info, I did not know before.
So unless you recompile those units, the compiler thinks that is the path where the unit is (or used to be).
I show you a very simple way to separe the UTF8chars of a string into a
StringList.
for i := 1 to Utf8Length(MyUTF8) do begin UChar := UTF8Copy (MyUTF8,i,1);
The output shows (both for Windows and Linux):UCS4String always has a 'hard' #0 termination (explicit terminating #0).
len(s)=21
65 66 32 195 164 195 182 195 188 195 159 32 195 132 195 150 195 156 32 49 50
len(z)=22
65 66 32 195 164 195 182 195 188 195 159 32 195 132 195 150 195 156 32 49 50 0
Do you have an idea why it does not work?
{$mode objfpc}{$H+} var s: UnicodeString; z: UCS4String; i: integer; begin s:='AB äöüß ÄÖÜ 12'; z:= UnicodeStringToUCS4String(s);
The output shows (both for Windows and Linux):
len(s)=21
65 66 32 195 164 195 182 195 188 195 159 32 195 132 195 150 195 156 32 49 50
len(z)=22
65 66 32 195 164 195 182 195 188 195 159 32 195 132 195 150 195 156 32 49 50 0
Do you have an idea why it does not work?
Does anybody know, why aboveString literals have always been a bit confusing to me in FPC.
s:=UnicodeString(s0);
does not convert from UTF8 to Unicode?
Then I changed my code to:but the result again was:
... var s0: UTF8String; s: UnicodeString; begin s0:='AB äöüß ÄÖÜ 12'; // should store as UTF8 s:=UnicodeString(s0); // should convert to Unicode to make UnicodeStringToUCS4String() happy ...
len(s)=21
41 42 20 C3 A4 C3 B6 C3 BC C3 9F 20 C3 84 C3 96 C3 9C 20 31 32
len(z)=22
41 42 20 C3 A4 C3 B6 C3 BC C3 9F 20 C3 84 C3 96 C3 9C 20 31 32 00
Does anybody know, why above
s:=UnicodeString(s0);
does not convert from UTF8 to Unicode?
14
15
41 42 20 E4 F6 FC DF 20 C4 D6 DC 20 31 32
00000041 00000042 00000020 000000E4 000000F6 000000FC 000000DF 00000020 000000C4 000000D6 000000DC 00000020 00000031 00000032 00000000
u:='AB äöüß ÄÖÜ 12';Which is strange because the middle 4 character (first after the space) are E4 F6 FC DF.
41 42 20 E4 F6 FC DF 20 C4 D6 DC 20 31 32
No, Martin is right:But that is about composition.
procedure TForm1.Button1Click(Sender: TObject); const s1 = #$C3#$a4; s2 = 'a' + #$CC#$88; begin ShowMessage('Is ' + s1 + ' the same character as ' + s2 + '?'); end;
Take a look at this:From your link, using {$codepage utf8} should be correct, as Martin_fr recommended too. I had tried it already (see reply #26), but after this string 's' contained a Windows-charset (Ansi 1252?), although I was currently on Linux.
https://wiki.freepascal.org/Unicode_Support_in_Lazarus#String_Literals
According to that page, assigning your literal string to a unicodestring will fail (without the {$codepage utf8}).
What version of FPC are you using??
Using const s : String = 'AB äöüß ÄÖÜ 12'; it might work better.
or even var s : String = 'AB äöüß ÄÖÜ 12';
Thank you for that link. I had a look into the sources and found, that they all require Unit LazUTF8, which I want to avoid if possible (see reply #12) for such a primitive usage as I have now. But I wrote me a notice for this link for possible future needs.
Why this? Just curious...It has no impact on code generation, but it looks silly.
const UTF8PROC_NULLTERM = 1 shl 0;
Why did you not try with a supported version, like 3.2.0 instead of the unsupported 3.0.4?I had tried with a 3.2.0 beta, but it made no difference, so I didn't mention it.
My enumerator should not require LazUTF8I have had a look in both of your links (reply #16), but didn't understand much, what I saw and didn't find something, which looked to me that it could help me (don't know what an "enumerator" is and how this could solve my problem). I thought both links were about updating old datafiles.
Hello PascalDragon, I tried your demo (reply #28) and got the same results as you. But to be honest, I don't understand what you want to show me with that. Both 's' and 'z' show Bytes, not characters, which are in Windows-charset (Ansi 1252?), although I am currently on Linux. What I want is to split an UTF8-string into it's UTF8-characters.
Because we could not find out, why system.UnicodeStringToUCS4String() did not work and because I did not want to accept so many problems and disadvantages of Unit LazUTF8 (see reply #12) - for such a primitive usage which I have now - I wrote the 3 very simple UTF8-functions which I only needed:
Hello PascalDragon, I tried your demo (reply #28) and got the same results as you. But to be honest, I don't understand what you want to show me with that. Both 's' and 'z' show Bytes, not characters, which are in Windows-charset (Ansi 1252?), although I am currently on Linux. What I want is to split an UTF8-string into it's UTF8-characters.
If you look at the assembly code you'll see that the string itself is encoded as UTF-8 and it's already correctly converted to UTF-16 upon the assignment. In the case of your specific example nothing obvious happens, because all characters you used are code points <= $FF.
Because we could not find out, why system.UnicodeStringToUCS4String() did not work and because I did not want to accept so many problems and disadvantages of Unit LazUTF8 (see reply #12) - for such a primitive usage which I have now - I wrote the 3 very simple UTF8-functions which I only needed:
No, it's not UnicodeSTringToUCS4String that did not work, it worked absolutely correctly with the input it got. What did not work is your input string. It was not correctly converted to a valid UTF-16 string and that is the problem.
Hello PascalDragon, I tried your demo (reply #28) and got the same results as you. But to be honest, I don't understand what you want to show me with that. Both 's' and 'z' show Bytes, not characters, which are in Windows-charset (Ansi 1252?), although I am currently on Linux. What I want is to split an UTF8-string into it's UTF8-characters.
If you look at the assembly code you'll see that the string itself is encoded as UTF-8 and it's already correctly converted to UTF-16 upon the assignment. In the case of your specific example nothing obvious happens, because all characters you used are code points <= $FF.
I'm not familiar reading assembly code. But you wrote in reply #28 "I have stored it as UTF-8 with BOM". That lets me guess, that you wanted to add an attachment, but there is no.
Because we could not find out, why system.UnicodeStringToUCS4String() did not work and because I did not want to accept so many problems and disadvantages of Unit LazUTF8 (see reply #12) - for such a primitive usage which I have now - I wrote the 3 very simple UTF8-functions which I only needed:
No, it's not UnicodeSTringToUCS4String that did not work, it worked absolutely correctly with the input it got. What did not work is your input string. It was not correctly converted to a valid UTF-16 string and that is the problem.
I believe that you are right, that UnicodeSTringToUCS4String() "only" got the wrong input. But you see, how many experiments we have tried, to solve the problem, but nobody found a solution which worked.
With Unit LazUTF8 I faced a lot of problems and disadvantages in the past. Some Examples I remember immediately:To be honest I do not fully understand your description, but you seem to have issues with usage of LazUTF8. I cannot imagine that LazUTF8 is the source of such errors; I use it in many projects and don't have any problems with it. But the one thing that I learned while trying to understand many issues with UTF8 is that one conversion at the wrong place can cause unrecoverable errors. So, please check your code and make sure that every string conversion is in place and needed. And don't expect programs written for fpc 3.x to work correctly with 2.6.4 or older, because v3 introduces a massive change in string handling.
- on Windows in console programs adding Unit LazUTF8 changes the charset of the filenames reported by sysutils.FindFirst() and FindNext() and the charset of the results of ParamStr() and the results of readln(). Without Unit LazUTF8 they return Windows-Charset (Ansi 1252?) / with Unit LazUTF8 they return UTF8. This difference makes live not easy.
- during decades I have written a couple of common libraries for console programs, which partly have problems with this differing charsets, so I get wrong results with Unit LazUTF8 because of the changed charset
- for a couple of programs I still use FPC 2.6.4 (with the same common libraries), but in FPC 2.6.4 obove functions never return UTF8 on Windows
- Windows-charset generally is much easier than UTF8 (as we see now), because each char is only 1 byte long
- in some cases (but not always) I had problems, that writeln() for an 'ansistring' showed damaged characters for Ä Ö Ü ä ö ü ß after I added Unit LazUTF8.
And there were more issues, which I only don't remember in a sudden. So I want to avoid Unit LazUTF8, especially in console programs, wherever possible.
I have attached an example LCL project that shows the important point:
- either your file needs to be stored as “UTF-8 with BOM” (you need to do a right click in the editor, go to “File Settings” (or similar, I'm using German) and then “Character Encoding”, confirm the dialog to change the file)
- or you need to store it as “UTF-8” and add {$codepage utf8}
In both cases the constant string data will be stored as UTF-16 if you simply assign it to a UnicodeString (or as UTF-8 data if you assign it to a String or UTF8String).
And don't expect programs written for fpc 3.x to work correctly with 2.6.4 or older, because v3 introduces a massive change in string handling.
I am attaching a demo for FindFirst which finds a file "testäöü.txt" which should cause issues according to your description. Me me, it does not. The filename is displayed in the console correctly with fpc 3.2 as well as 2.6.4 (after conversion). (NOTE: my windows cp is 1252. If yours is different the filename may appear differently than shown here). And it does not make a difference whether LazUTF8 is linked in or not (activate/deactivate the define USE_LAZUTF8).
And don't expect programs written for fpc 3.x to work correctly with 2.6.4 or older, because v3 introduces a massive change in string handling.
Oh, there we misunderstood: I do not try to run programs with FPC 2.6.4, which have been written for FPC 3.x. What I have is a couple of "common libraries", which exist only once, and I use these both for all new programs with FPC 3.x and for a couple of older programs, which I still compile with FPC 2.6.4, because they have originally been written for 2.6.4, and (until now) I did not update them to 3.x.
I found the reason in, that Unit LConvEncoding is used, which contains:Yes thanks, this shows me the difference. But these changes are not introduced by LazUTF8 but by unit FPCAdds in order to adapt to the new FPC strings. A problem could be for you that this unit is "used" by some other units, independently of LazUTF8: Do a "Find in files" over the lcl directory of your Lazarus installation and you'll find it in graphics, intfgraphics, imglist, lresources. OK - your interest is in console programs, so you probably will not use them.so LazUTF8 was always included and you saw no difference.
uses SysUtils, Classes, dos, LazUTF8
After deactivating units LazUTF8 and LConvEncoding I had exactly the difference, which I was talking about. I added some lines to make the difference clearer:
[...]
You should update to FPC 3.2.What informations do you have that you know, that this will improve anything of the problems in this Topic?
I had tried with a 3.2.0 beta, but it made no difference...
Yes thanks, this shows me the difference. But these changes are not introduced by LazUTF8 but by unit FPCAdds in order to adapt to the new FPC strings. A problem could be for you that this unit is "used" by some other units, independently of LazUTF8: Do a "Find in files" over the lcl directory of your Lazarus installation and you'll find it in graphics, intfgraphics, imglist, lresources. OK - your interest is in console programs, so you probably will not use them.You are right, I wrote only a handful of GUI programs, all the rest are console programs.
What FPCAdds does is it sets codepage conversion code in its initialization section - see also https://wiki.freepascal.org/Unicode_Support_in_Lazarus#Technical_implementation. And that gives me an idea: Why not call them at the beginning of your project with the default Windows ANSI codepage, and you'll avoid UTF8-conversion even if LazUTF8 is in "uses". As shown below this works in my console program (tested Win 10)This sounds verry interesting! If that works it would be great! Thanks a lot for that idea.
And if you had read https://wiki.freepascal.org/FPC_Unicode_support,During the years I have read a lot of Wikis and other documentation about UTF8 / Unicode / codepages etc. in FPC and LCL. But several parts I have forgotten again over the time (because I need all this very seldom) and for several parts I did not really understand all what I read, because all this stuff about codepages / Unicode / UTF8 / codepoints / collations etc. etc. is not my world.
you would be less surprised about unicode-settings:
system.defaultSystemCodePage = cp_utf8
system.defaultFileSystemCodePage = cp_utf8
system.defaultRTLFileSystemCodePage = cp_utf8
What FPCAdds does is it sets codepage conversion code in its initialization section - see also https://wiki.freepascal.org/Unicode_Support_in_Lazarus#Technical_implementation. And that gives me an idea: Why not call them at the beginning of your project with the default Windows ANSI codepage, and you'll avoid UTF8-conversion even if LazUTF8 is in "uses". As shown below this works in my console program (tested Win 10)
With Unit LazUTF8 I faced a lot of problems and disadvantages in the past:
a) on Windows in console programs adding Unit LazUTF8 changes the charset of the filenames reported by sysutils.FindFirst() and FindNext(). Without Unit LazUTF8 they return Windows-Charset (ANSI 1252) / with Unit LazUTF8 they return UTF8. This difference makes live not easy.
b) dito for all other procedures and functions which deal with filenames or folders
c) dito for the charset of the results of ParamStr()
d) dito for the results of readln()
e) during decades I have written a couple of common libraries for console programs, which partly have problems with this differing charsets, so then I get wrong results with Unit LazUTF8 because of the changed charset
f) for a couple of older programs I still use FPC 2.6.4 (with the same common libraries), but in FPC 2.6.4 obove functions never return UTF8 on Windows
g) in some cases (but not always) I had problems, that writeln() for an 'ansistring' showed damaged characters for Ä Ö Ü ä ö ü ß when I added Unit LazUTF8.
Usage: start this program in a Windows 7 Console with a command line parameter like "├ñ├Â├╝"
FPC-Version: 3.3.1
Unit LazUTF8: YES
1) WITHOUT call of set_charset_WIN() => Results of ParamStr():
- string[s255] => C3 A4 C3 B6 C3 BC len=6
- ansistring => C3 A4 C3 B6 C3 BC len=6
- ansi(1252) => E4 F6 FC len=3
2) WITH call of set_charset_WIN() => Results of ParamStr():
- string[s255] => E4 F6 FC len=3
- ansistring => E4 F6 FC len=3
- ansi(1252) => E4 F6 FC len=3