[SOLVED] How to split UTF8-strings into it's characters (not bytes)?

wp

Hero Member
Posts: 11916

Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?

« Reply #45 on: September 26, 2020, 07:13:06 pm »

Quote from: Hartmut on September 26, 2020, 06:09:48 pm

Quote from: wp on September 26, 2020, 12:13:28 pm
And don't expect programs written for fpc 3.x to work correctly with 2.6.4 or older, because v3 introduces a massive change in string handling.

Oh, there we misunderstood: I do not try to run programs with FPC 2.6.4, which have been written for FPC 3.x. What I have is a couple of "common libraries", which exist only once, and I use these both for all new programs with FPC 3.x and for a couple of older programs, which I still compile with FPC 2.6.4, because they have originally been written for 2.6.4, and (until now) I did not update them to 3.x.

FPC changed string handling massively with version 3.0.x. Sticking to a library which is not updated to fpc 3 calls for trouble with strings. You must take the time to update your shared units. Otherwise the only real recommendation is to stick to 2.6.4.

Quote from: Hartmut on September 26, 2020, 06:09:48 pm

I found the reason in, that Unit LConvEncoding is used, which contains:
Code: Pascal [Select][+][-]
uses SysUtils, Classes, dos, LazUTF8
so LazUTF8 was always included and you saw no difference.

After deactivating units LazUTF8 and LConvEncoding I had exactly the difference, which I was talking about. I added some lines to make the difference clearer:
[...]

Yes thanks, this shows me the difference. But these changes are not introduced by LazUTF8 but by unit FPCAdds in order to adapt to the new FPC strings. A problem could be for you that this unit is "used" by some other units, independently of LazUTF8: Do a "Find in files" over the lcl directory of your Lazarus installation and you'll find it in graphics, intfgraphics, imglist, lresources. OK - your interest is in console programs, so you probably will not use them.

What FPCAdds does is it sets codepage conversion code in its initialization section - see also https://wiki.freepascal.org/Unicode_Support_in_Lazarus#Technical_implementation. And that gives me an idea: Why not call them at the beginning of your project with the default Windows ANSI codepage, and you'll avoid UTF8-conversion even if LazUTF8 is in "uses". As shown below this works in my console program (tested Win 10):

Code: Pascal [Select][+]

program Project1;
 
{$DEFINE USE_LazUTF8}
 
uses
  Windows,   // for: GetConsoleOutputCP
  Sysutils //, FPCAdds
  {$IFDEF USE_LAZUTF8}
  , LazUTF8, LConvEncoding
  {$ENDIF}
  ;
 
var
  consCP: Integer;
  winCP: Integer;
  SR: TSearchRec;
  i: Integer;
 
begin
  consCP := GetConsoleCP;
  winCP := GetACP;
 
  SetMultiByteConversionCodePage(winCP);
  SetMultiByteFileSystemCodePage(winCP);
  SetMultiByteRTLFileSystemCodePage(winCP);
 
  WriteLn('The codepage of the console is ', consCP);
  WriteLn('System codepage is ', winCP);
  WriteLn;
 
  if FindFirst('test*.*', faAnyFile, SR) = 0 then    // to catch file 'testäöü.txt'
  begin
    repeat
      WriteLn('Filename as returned by FindFirst: ', SR.Name);
      for i:=1 to length(SR.Name) do  write(HexStr(ord(SR.Name[i]),2), ' ');
      writeln('len=', length(SR.Name));
      {$IF FPC_FullVersion < 30000}
      WriteLn('Filename after codepage conversion: ', ConvertEncoding(SR.Name, 'cp'+IntToStr(winCP), 'cp'+IntToStr(consCP)));
      {$IFEND}
     until FindNext(SR) <> 0;
    FindClose(SR);
  end;
 
  {$IFDEF USE_LAZUTF8}
  WriteLn(UTF8Length('äöü'));
  {$ENDIF}
 
  ReadLn;
end.       

« Last Edit: September 26, 2020, 07:15:50 pm by wp »

Logged

Hartmut

Hero Member
Posts: 749

Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?

« Reply #46 on: September 27, 2020, 11:00:18 am »

Quote from: nanobit on September 26, 2020, 06:34:44 pm

You should update to FPC 3.2.

What informations do you have that you know, that this will improve anything of the problems in this Topic?
Have you updated to FPC 3.2? If yes, did you try even 1 example in this Topic with it? Did it work better?
And did you read what I wrote:

Quote from: Hartmut on September 26, 2020, 09:10:36 am

I had tried with a 3.2.0 beta, but it made no difference...

Quote from: wp on September 26, 2020, 07:13:06 pm

Yes thanks, this shows me the difference. But these changes are not introduced by LazUTF8 but by unit FPCAdds in order to adapt to the new FPC strings. A problem could be for you that this unit is "used" by some other units, independently of LazUTF8: Do a "Find in files" over the lcl directory of your Lazarus installation and you'll find it in graphics, intfgraphics, imglist, lresources. OK - your interest is in console programs, so you probably will not use them.

You are right, I wrote only a handful of GUI programs, all the rest are console programs.

Quote

What FPCAdds does is it sets codepage conversion code in its initialization section - see also https://wiki.freepascal.org/Unicode_Support_in_Lazarus#Technical_implementation. And that gives me an idea: Why not call them at the beginning of your project with the default Windows ANSI codepage, and you'll avoid UTF8-conversion even if LazUTF8 is in "uses". As shown below this works in my console program (tested Win 10)

This sounds verry interesting! If that works it would be great! Thanks a lot for that idea.
Please give me some time for some researches, because you use a couple of functions I never heard of and to find out, what they do and for a couple of tests, if / which problems with Unit LazUTF8 then disappear and which maybe not. Maybe could need some days. I will report afterwards.

Logged

nanobit

Full Member
Posts: 160

Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?

« Reply #47 on: September 27, 2020, 12:37:16 pm »

I did not say FPC 3.2 alone would solve your specific problem,
but FPC 3.2 has some unicode related RTL improvements over FPC 3.0.4.
Generally, testers prefer to start with newer (bug-fixed) versions.

And if you had read https://wiki.freepascal.org/FPC_Unicode_support,
you would be less surprised about unicode-settings:
system.defaultSystemCodePage = cp_utf8
system.defaultFileSystemCodePage = cp_utf8
system.defaultRTLFileSystemCodePage = cp_utf8

Logged

Hartmut

Hero Member
Posts: 749

Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?

« Reply #48 on: September 28, 2020, 12:28:27 pm »

Quote from: nanobit on September 27, 2020, 12:37:16 pm

And if you had read https://wiki.freepascal.org/FPC_Unicode_support,
you would be less surprised about unicode-settings:
system.defaultSystemCodePage = cp_utf8
system.defaultFileSystemCodePage = cp_utf8
system.defaultRTLFileSystemCodePage = cp_utf8

During the years I have read a lot of Wikis and other documentation about UTF8 / Unicode / codepages etc. in FPC and LCL. But several parts I have forgotten again over the time (because I need all this very seldom) and for several parts I did not really understand all what I read, because all this stuff about codepages / Unicode / UTF8 / codepoints / collations etc. etc. is not my world.

So I am very happy that wp found out and explained, where my problems with Unit LazUTF8 originally come from (the Initialization-part of Unit FPCAdds) and his idea, how this could be avoided very easily :-) I'm testing his suggestion and until now it looks promising.

Logged

nanobit

Full Member
Posts: 160

Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?

« Reply #49 on: September 28, 2020, 12:53:32 pm »

Ok, but don't forget: If your folder contains a mixture of german (umlaut)
and cyrillic filenames, you still need cp_utf8 instead of winCp.

Logged

Hartmut

Hero Member
Posts: 749

Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?

« Reply #50 on: October 02, 2020, 04:11:32 pm »

Quote from: wp on September 26, 2020, 07:13:06 pm

What FPCAdds does is it sets codepage conversion code in its initialization section - see also https://wiki.freepascal.org/Unicode_Support_in_Lazarus#Technical_implementation. And that gives me an idea: Why not call them at the beginning of your project with the default Windows ANSI codepage, and you'll avoid UTF8-conversion even if LazUTF8 is in "uses". As shown below this works in my console program (tested Win 10)

Hello wp,
now I have made extensive tests with your suggestion (reply #45) and I have good news!

Quote

With Unit LazUTF8 I faced a lot of problems and disadvantages in the past:
a) on Windows in console programs adding Unit LazUTF8 changes the charset of the filenames reported by sysutils.FindFirst() and FindNext(). Without Unit LazUTF8 they return Windows-Charset (ANSI 1252) / with Unit LazUTF8 they return UTF8. This difference makes live not easy.
b) dito for all other procedures and functions which deal with filenames or folders
c) dito for the charset of the results of ParamStr()
d) dito for the results of readln()
e) during decades I have written a couple of common libraries for console programs, which partly have problems with this differing charsets, so then I get wrong results with Unit LazUTF8 because of the changed charset
f) for a couple of older programs I still use FPC 2.6.4 (with the same common libraries), but in FPC 2.6.4 obove functions never return UTF8 on Windows
g) in some cases (but not always) I had problems, that writeln() for an 'ansistring' showed damaged characters for Ä Ö Ü ä ö ü ß when I added Unit LazUTF8.

I tested all above problems with FPC 3.0.4 and 3.3.1 beta on Windows 7 and with 1 exception all of them are solved by your suggestion. Many many thanks to you for that great and very-easy-to-use idea!

The exception is c) concerning the results of ParamStr(). I created a small demo for that (see attached as compilable project):

Code: Pascal [Select][+]

procedure set_charset_WIN;
   {switches 3 codepages on Windows to ANSI-1252, which have been changed before
    to UTF8, if Unit 'LazUTF8' is included}
   var winCP: UINT; {dword}
   begin
   winCP:=windows.GetACP; {gets System codepage}
   SetMultiByteConversionCodePage(winCP);
   SetMultiByteFileSystemCodePage(winCP);
   SetMultiByteRTLFileSystemCodePage(winCP);
   end;
 
procedure Test_ParamStr;
   {shows the charset returned by system.ParamStr() OR objpas.ParamStr() 
    depending of current Compiler "$mode".
    Usage: start the program in a Windows 7 Console with a command line
    parameter like "äöü".
    If then result = "E4 F6 FC len=3" => WINDOWS-charset (ANSI 1252) /
    if then result = "C3 A4 C3 B6 C3 BC len=6" => UTF8-charset}
   type ansi_1252 = type AnsiString(1252); {Windows-charset}
   var sa: ansistring;
       sw: ansi_1252;
       ss: string[255];
       i: integer;
   begin
   writeln('Results of ParamStr():');
   ss:=ParamStr(1);                  // type shortstring:
   write(' - string[s255] => ');
   for i:=1 to length(ss) do  write(HexStr(ord(ss[i]),2), ' ');
   writeln('len=', length(ss));
 
   sa:=ParamStr(1);                  // type ansistring:
   write(' - ansistring   => ');
   for i:=1 to length(sa) do  write(HexStr(ord(sa[i]),2), ' ');
   writeln('len=', length(sa));
 
   sw:=ParamStr(1);                  // type AnsiString(1252):
   write(' - ansi(1252)   => ');
   for i:=1 to length(sw) do  write(HexStr(ord(sw[i]),2), ' ');
   writeln('len=', length(sw));
   end; 

Info: {$mode objfpc} causes that ParamStr() of Unit 'objpas' is used / {$mode TP} causes that ParamStr() of Unit 'system' is used.
The results are (both in FPC 3.0.4 and 3.3.1 beta):
Unit call of charset charset {$mode} LazUTF8 set_charset_WIN() 'ss+sa' 'sw' --------------------------------------------------------- objfpc without no WIN WIN "" without yes WIN WIN "" with no UTF8 WIN "" with yes UTF8 UTF8 TP without no WIN WIN "" without yes WIN WIN "" with no UTF8 UTF8 "" with yes UTF8 UTF8

We see, that the call of set_charset_WIN() unfortunately does never change from UTF8 to WIN (it makes only a difference in 1 rare case with type 'AnsiString(1252)', but 1) it changes into UTF8, what doesn't help me and 2) I never used type 'AnsiString(1252)' in combination with ParamStr(), so this case is not of interest).

Do you (or someone else) have an idea, how the returned charset of ParamStr() can be switched from UTF8 to WIN, if Unit LazUTF8 is included (without damaging all the other solved cases above)? I search a "global" solution like above procedure set_charset_WIN(), which has only to be called once at the start of a concerned program. Of course I'm not keen on to adapt every single usage of ParamStr() in my programs and libraries individually (more than 200).

Thanks to all for your help!

Test_LazUTF8.zip (1.79 kB - downloaded 58 times.)

Logged

wp

Hero Member
Posts: 11916

Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?

« Reply #51 on: October 04, 2020, 12:45:06 am »

I had the idea to compile your demo program with today's FPC 3.3.1 -- and here is the output (mode ObjFPC):

Code: [Select]

Usage: start this program in a Windows 7 Console with a command line parameter like "├ñ├Â├╝"
FPC-Version:  3.3.1
Unit LazUTF8: YES

1) WITHOUT call of set_charset_WIN() => Results of ParamStr():
 - string[s255] => C3 A4 C3 B6 C3 BC len=6
 - ansistring   => C3 A4 C3 B6 C3 BC len=6
 - ansi(1252)   => E4 F6 FC len=3

2) WITH call of set_charset_WIN() => Results of ParamStr():
 - string[s255] => E4 F6 FC len=3
 - ansistring   => E4 F6 FC len=3
 - ansi(1252)   => E4 F6 FC len=3

--> Working! (but not for mode TP)

Then I also tried FPC-fixes, but I get the same result as with FPC 3.2.0 (6 bytes in case (2))

« Last Edit: October 04, 2020, 12:48:37 am by wp »

Logged

Hartmut

Hero Member
Posts: 749

Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?

« Reply #52 on: October 04, 2020, 03:38:59 pm »

Hello wp, thanks a lot for your continuing and valuable help. This is good news, that with a current FPC 3.3.1 the problems with ParamStr() and Unit LazUTF8 at least in mode ObjFPC can also switched off by your solution. I implemented a call to set_charset_WIN() at the start of my console programs which use Unit LazUTF8 (were not very much).

Unfortunately during my Tests with FPC 3.3.1 beta I stumbled over one more new UTF8-Problem, which wasted a lot of time to dive into it: when I read data from a SQLite-DB, for so many years (at least since FPC 2.6.4) this data was always in UTF8 - for console and GUI programs - regardless whether Unit LazUTF8 was used or not. But now (if Unit LazUTF8 and set_charset_WIN() are not used) this data is instead returned in Windows-charset (ANSI 1252)!! But the Select-Statements, if they include characters like Ä Ö Ü ä ö ü ß, have still to be in UTF8!! How crazy!

Reading the release notes for FPC 3.2 one more time I found there https://wiki.freepascal.org/User_Changes_3.2#CodePage_aware_TStringField_and_TMemoField. I do not understand really much of what is written there and what I tried, inspired by this infos, setting e.g. SQLite3Connection1.CharSet:='UTF8' (directly after dynamic creation of that var), did not help. But I want to continue to research this at a later moment and if necessary, will open a new Topic for this new problem.

« Last Edit: October 04, 2020, 03:42:05 pm by Hartmut »

Logged

Lazarus

Bookstore

Search

Recent

Author Topic: [SOLVED] How to split UTF8-strings into it's characters (not bytes)? (Read 10831 times)

wp

Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?

Hartmut

Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?

nanobit

Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?

Hartmut

Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?

nanobit

Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?

Hartmut

Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?

wp

Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?

Hartmut

Re: [SOLVED] How to split UTF8-strings into it's characters (not bytes)?

	Computer Math and Games in Pascal (preview)
	Lazarus Handbook