Recent

Author Topic: Unicode won't work no matter what I do.  (Read 1022 times)

lucamar

  • Hero Member
  • *****
  • Posts: 734
Re: Unicode won't work no matter what I do.
« Reply #15 on: January 14, 2019, 10:55:43 pm »
Talking of "nasty" file names, I had quite a few problems with Lazarus' serialized backups: "fmain.lfm;10", "fmain.pas;23" an so on. Some routines in the RTL/FCL/LCL treat those as two names separated by a ';'. Baffled me no end until I discovered just why what I was doing didn't work as expected :)
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus 1.8.4/FPC 3.0.4 on:
(K)Ubuntu 12..16, Windows XP SP3 (Home/Prof.) and various DOS incarnations.

engkin

  • Hero Member
  • *****
  • Posts: 2194
Re: Unicode won't work no matter what I do.
« Reply #16 on: January 14, 2019, 11:52:06 pm »
Just curious, as I do not use Linux, do you guys need to add a unit like cwstring to have a WideStringManager?
WideStringManager is only needed on Windows. Why do you mix it here?
According to the docs?:
Quote
It makes no sense to use this unit on a non-POSIX system like Windows, OS/2 or DOS. Therefor it should always be enclosed with an ifdef statement:

program myprogram;

uses
  {$ifdef unix}cwstring,{$endif}
   classes, sysutils;

So, cwstring *is* for Linux, and it *does* change the WideStringManager according to its source code:
Code: Pascal  [Select]
  1. Procedure SetCWideStringManager;
  2. ...
  3.   SetUnicodeStringManager(CWideStringManager);

Just in case:
Code: Pascal  [Select]
  1. \\rtl\inc\ustrings.inc
  2. Procedure SetUnicodeStringManager (Const New : TUnicodeStringManager);
  3. begin
  4.   widestringmanager:=New;
  5. end;

Quote
what do you get for:
DefaultSystemCodePage
DefaultFileSystemCodePage
DefaultRTLFileSystemCodePage?
A user does not need to set such things, especially not on Linux. Why do you mix it here?
I am not asking him to set these variables, the question was what values do they hold.

Quote
Are you really using Lazarus: 1.6.2 and not 1.8.2 or that's just a typo?
Makes no difference. Both use the new UTF-8 system. On Linux it matters even less.
I take your word, but with the FPC version he reported in his first post the possibility of wrong settings being used to compile RTL/LCL is there.

Quote
Your testdir.zip testdir.7z do not look right on my side: Win using 7z.
Exactly, now you are on the right track.
Thank you. I had to stab in every direction. Now to the fun part:
åäö is 6 bytes in UTF8.
In his zip file, if you open it with Lazarus and switch the encoding to UTF8, you see:
Int_filename_åäö.txt  instead of Int_filename_åäö.txt
If you check, using a hex editor, how many bytes between Int_filename_ and .txt you'll find *12* bytes, not 6 bytes.
In his zip file every byte of the original UTF8 åäö was considered ANSI and converted to UTF8.
Are you following me here?

Quote
Some of your LC_* values are between quotation marks and some without, I don't know if that makes a difference?
What LC_* values?
From his first post:
Quote
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME=en_AU.utf8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 3563
  • I like bugs.
Re: Unicode won't work no matter what I do.
« Reply #17 on: January 15, 2019, 12:07:37 am »
Yes, sorry, cwstring is needed for string comparison and it contains WideStringManager.
I confused things.

engkin

  • Hero Member
  • *****
  • Posts: 2194
Re: Unicode won't work no matter what I do.
« Reply #18 on: January 15, 2019, 12:20:31 am »
åäö is 6 bytes in UTF8.

In his zip file, if you open it with Lazarus and switch the encoding to UTF8, you see:
Int_filename_åäö.txt  instead of Int_filename_åäö.txt

If you check in this same file, using a hex editor, how many bytes between Int_filename_ and .txt you'll find *12* bytes, not the expected 6 bytes.

In his zip file every byte of the original UTF8 åäö was considered ANSI and converted to UTF8, again.

His system is not right

NonSpillable

  • New member
  • *
  • Posts: 7
Re: Unicode won't work no matter what I do.
« Reply #19 on: January 18, 2019, 12:43:07 pm »
Just curious, as I do not use Linux, do you guys need to add a unit like cwstring to have a WideStringManager?

@NonSpillable,

what do you get for:
DefaultSystemCodePage
DefaultFileSystemCodePage
DefaultRTLFileSystemCodePage?

Are you really using Lazarus: 1.6.2 and not 1.8.2 or that's just a typo?

Your testdir.zip testdir.7z do not look right on my side: Win using 7z.

Edit:
Some of your LC_* values are between quotation marks and some without, I don't know if that makes a difference?

Doing a quick search gave me the impression that the values you see for LC_* in a console could be different than their counterpart for a GUI.
I've tried those, thanks to Internet searches for the problem, but to no avail.

However, the problem seems gone now! :) The only thing I did was to run a "dpkg-reconfigure locales", and it solved the problem! What the problem was, what kind of locale some part of the system thought I had, is probably lost to history. Strange thing that it only affected lazarus programs, not my programs in gcc or any other installed software.

No, I double checked. Version 1.6.2 is correct. I'm using Debian and one well known "issue" with Debian is old packages. A package only makes its way to Debian if it's old and stable. It's both a pain and a blessing.
Edit: Could there be any problems with EXT4, with linux eller other things, than lazarus? But lazarus is the only thing not working (that is, FindFirst/FindNext).
FindFirst/FindNext work perfectly well. Your directory and file names are just plain wrong.
I don't understand why you don't see it in your Caja file manager.
See the attached screenshot of my Dolphin file manager.

Quote
1) Side note, but since early 2000 I always thought that the time was ripe for 64-bit, but no, every single time I try 64-bits linux it let me down, something breaks, and breaks bad. This time it was lazarus. Other times it has been CAD-software, visualization software, media players/codecs, etc, etc, etc.
Nonsense. I have used 64-bit Lazarus on Linux for about 7-8 years. Works well.

Well, the names are correct in Caja, in Thunar and in the console. The reason I think is that encoding isn't really a thing with file names. There seem to have been something that forces lazarus (and perhaps 7z even if I cannot spot a problem) to use a "locale" or "codepage" that is nonsense. As said, a dpkg-reconfigure solved the lazarus problem.
PS! I didn't say that lazarus was broken consistently on x64, only that some shit always happen with x64. This time it happens to be lazarus, but not other times. 

NonSpillable

  • New member
  • *
  • Posts: 7
Re: Unicode won't work no matter what I do.
« Reply #20 on: January 18, 2019, 12:48:12 pm »
(Sorry for multiple posts. A lot to address here. The problem is resolved by rebuilding locale on my Debian system, as I said)

Thank you. I had to stab in every direction. Now to the fun part:
åäö is 6 bytes in UTF8.
In his zip file, if you open it with Lazarus and switch the encoding to UTF8, you see:
Int_filename_åäö.txt  instead of Int_filename_åäö.txt
If you check, using a hex editor, how many bytes between Int_filename_ and .txt you'll find *12* bytes, not 6 bytes.
In his zip file every byte of the original UTF8 åäö was considered ANSI and converted to UTF8.
Are you following me here?
Yes! I think so. It seems like an UTF8 string converted to UTF8 (from ANSI/ASCII)! I really wonder why my system ended up with such problems and why lazarus and possible 7z were the only software to fail. Let me upload the exact same dir compressed the exactly same way again and see if the name encoding has changed.

NonSpillable

  • New member
  • *
  • Posts: 7
Re: Unicode won't work no matter what I do.
« Reply #21 on: January 18, 2019, 12:59:36 pm »
åäö is 6 bytes in UTF8.

In his zip file, if you open it with Lazarus and switch the encoding to UTF8, you see:
Int_filename_åäö.txt  instead of Int_filename_åäö.txt

If you check in this same file, using a hex editor, how many bytes between Int_filename_ and .txt you'll find *12* bytes, not the expected 6 bytes.

In his zip file every byte of the original UTF8 åäö was considered ANSI and converted to UTF8, again.

His system is not right
Yep, that is precisely my interpretation too. But why? Thats the big problem with UTF8 and all schemes to encode, there is no way to tell if a string already is encoded or just have high bits sat in and older code page. Lazarus and 7z may simply be the only program on my system (since they are the only program to create some trouble) that interpret codepoint as high values of plain ANSI/ASCII. "å", "ä" and "ö" is in upper extended ASCII (>127 that is) and a string parser would yield different things, 3, 6 or 12 bytes depending on what assumptions about the string is made. My system clearly uses UTF8 as everyone else's, but some setting forces lazarus and 7z to assume that a 6-byte UTF8-string of "åäö" is a 6 character extended ASCII string. 

Thaddy

  • Hero Member
  • *****
  • Posts: 7337
Re: Unicode won't work no matter what I do.
« Reply #22 on: January 18, 2019, 01:05:11 pm »
Note:
In a current Debian (stretch) the versions are Laz 1.8.4 and FPC 3.0.4 when you add backports. Both are the current releases.
Brexit. My Indonesian and Dutch friends know what " Tempo doeloe" means....There is no empire.

engkin

  • Hero Member
  • *****
  • Posts: 2194
Re: Unicode won't work no matter what I do.
« Reply #23 on: Today at 04:13:49 pm »
The problem is resolved by rebuilding locale on my Debian system
Your new zip file has the same problem: 12 bytes instead of 6. The problem is simplified with this code:
Code: Pascal  [Select]
  1. var
  2.   s: string;
  3. begin
  4.   s := 'Int_filename_åäö.txt'; { UTF8 }
  5.  
  6.   SetCodePage(RawByteString(s),65001, false); { Mark the variable as UTF8 despite it does not contain the correct file name, but it is UTF8 }
  7.   SetCodePage(RawByteString(s),1252, true);  { Convert to some ANSI code page to get the correct UTF8 name }  //<----
  8.   SetCodePage(RawByteString(s),65001, false); { Mark the variable as UTF8, again!! to get the real name }
  9.  
  10.   ShowMessage(s); //or whatever
  11.  

I used the following code on my Windows system:
Code: Pascal  [Select]
  1. program Project1;
  2.  
  3. {$mode objfpc}{$H+}
  4. {$Codepage UTF8}
  5.  
  6. uses
  7.   {$IFDEF UNIX}{$IFDEF UseCThreads}
  8.   cthreads,
  9.   {$ENDIF}{$ENDIF}
  10.   Classes, windows
  11.   { you can add units after this };
  12.  
  13. var
  14.   s: string;
  15.   u:UnicodeString;
  16. begin
  17.   s := 'Int_filename_åäö.txt'; { UTF8 }
  18.  
  19.   SetCodePage(RawByteString(s),65001, false); { Mark the variable as UTF8 }
  20.   SetCodePage(RawByteString(s),1252, true);  { Convert to some ASNI code page }
  21.   SetCodePage(RawByteString(s),65001, false); { Mark the variable as UTF8, again!! }
  22.  
  23.   u := s;
  24.   MessageBoxW(0,@u[1],@u[1],0);
  25. end.