Recent

Author Topic: Unicode won't work no matter what I do.  (Read 6737 times)

NonSpillable

  • New Member
  • *
  • Posts: 10
Unicode won't work no matter what I do.
« on: January 14, 2019, 06:03:28 am »
Hi. Since upgrading my debian installation and to latest lazarus (from debians repo) none of my programs accessing files will work. Nordic chars like Å, Ä and Ö is replaced by "?" in all components when reading a file name.

I created a new test application to try different things, attached below. Note that my test app adds "åäö" to a TMemo to demonstrate that it's not a font issue. FindFirst/FindFirstUTF8 and every conceivable combination of Utf8ToSys, Ansitowhatever, Utftowhatever does exactly nothing. How do I fix this?

My system: Debian 9 64-bit. Lazarus: 1.6.2+dfsg-2 date 2019-01-12, FPC version 3.0.0.
Output from '$ locale':
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME=en_AU.utf8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

CCRDude

  • Hero Member
  • *****
  • Posts: 600
Re: Unicode won't work no matter what I do.
« Reply #1 on: January 14, 2019, 07:28:53 am »
In your test application, where do you take the string from? If you use it as a const, have you specified {$codepage utf8}?

egsuh

  • Hero Member
  • *****
  • Posts: 1292
Re: Unicode won't work no matter what I do.
« Reply #2 on: January 14, 2019, 07:45:18 am »
In Windows, there are something called "locale", "Locale for non unicode-supporting applications" in full.  Basically this must be the same as the operating system language but I could set it as other language and then characters are not displayed correctly in some applications. Not sure about Linux.

NonSpillable

  • New Member
  • *
  • Posts: 10
Re: Unicode won't work no matter what I do.
« Reply #3 on: January 14, 2019, 09:33:27 am »
In your test application, where do you take the string from? If you use it as a const, have you specified {$codepage utf8}?
From either a TSearchRec.Name via FindFirst/Next or FindFirstUTF8/NextUTF8 or from a TSearchRecUTF8. And a short constant in the code containing "åäö" is added to the TMemo to demonstrate that it is not a font issue. That is the "åäö" in the first line in the TMemo. As you see, reading file names from disk doesn't work.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4468
  • I like bugs.
Re: Unicode won't work no matter what I do.
« Reply #4 on: January 14, 2019, 11:20:00 am »
Your code works well but the file and directory names in testdir are not UTF-8.
It feels strange because in a Linux system everything is UTF-8 by default. You must have copied it from a Windows PC.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

NonSpillable

  • New Member
  • *
  • Posts: 10
Re: Unicode won't work no matter what I do.
« Reply #5 on: January 14, 2019, 11:56:06 am »
Your code works well but the file and directory names in testdir are not UTF-8.
It feels strange because in a Linux system everything is UTF-8 by default. You must have copied it from a Windows PC.
Nope, they are created in Caja (MATEs file manager). No Windows here. If the file names are not UTF8 (how do you know?), they must be something older, and should therefore work as well. The files are created today, with Caja, on a Ext4 file system, on a up-to-date Debian installation.

Edit: Let me upload the exact same files but in a 7z-archive... Edit2: And one file raw.
« Last Edit: January 14, 2019, 11:59:21 am by NonSpillable »

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4468
  • I like bugs.
Re: Unicode won't work no matter what I do.
« Reply #6 on: January 14, 2019, 02:03:31 pm »
Nope, they are created in Caja (MATEs file manager). No Windows here. If the file names are not UTF8 (how do you know?),
I know by clicking your package which then opens "Ark" in my Manjaro Linux. Ark comes with KDE.
No need to even extract the files. I can see in the Ark window the encoding is wrong.

Quote
they must be something older, and should therefore work as well.
What do you mean? Things work with UTF-8 encoding. Old or new, doesn't matter.

Quote
Edit: Let me upload the exact same files but in a 7z-archive... Edit2: And one file raw.
The single file has the right encoding. It means your archiving process goes wrong.
Let me guess, you run 7z under Wine. Hah! Why would you do that? :)
BTW, you should see the wrong encoding right in the Caja window. You don't need a test app for that.
« Last Edit: January 14, 2019, 02:05:31 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

NonSpillable

  • New Member
  • *
  • Posts: 10
Re: Unicode won't work no matter what I do.
« Reply #7 on: January 14, 2019, 04:23:54 pm »
Nope, they are created in Caja (MATEs file manager). No Windows here. If the file names are not UTF8 (how do you know?),
I know by clicking your package which then opens "Ark" in my Manjaro Linux. Ark comes with KDE.
No need to even extract the files. I can see in the Ark window the encoding is wrong.

Quote
they must be something older, and should therefore work as well.
What do you mean? Things work with UTF-8 encoding. Old or new, doesn't matter.

Quote
Edit: Let me upload the exact same files but in a 7z-archive... Edit2: And one file raw.
The single file has the right encoding. It means your archiving process goes wrong.
Let me guess, you run 7z under Wine. Hah! Why would you do that? :)
BTW, you should see the wrong encoding right in the Caja window. You don't need a test app for that.
Wine? What is it with people and windows? Noooo, I haven't run Win, nor any win program for nearly two decades! I'm using stock zip and stock 7z, from the GUI (Caja in this case). Nothing has anything with win, OSX, BSD or other OSes to do than Debian 9. No files I have uploaded here has been anywhere near any Windows or Wine, and are created today/yesterday.

What I meant with "older" was that there was a time before UTF8, when we had code pages for international tokens. No matter *what* decoding the file system use, Lazarus is the only thing not handling characters correctly. On my disk I have several decades of files, some with Swedish (åäöÅÄÖ) chars, created in different programs, different OSes (some as old as C64, Amiga, Atari, etc) and yes, even files created from DOS. I have never had any problem with any applications until now, when I did a fresh install of Debian 9 AMD64 (I usually use 32-bit) on an i7, fresh lazarus/FPC and suddenly all my programs stop working (I had to recompile them, since I migrate from 32- to 64-bits¹). But no other applications seems to have any problem, new or old, with new or old files.

Edit: Could there be any problems with EXT4, with linux eller other things, than lazarus? But lazarus is the only thing not working (that is, FindFirst/FindNext).

1) Side note, but since early 2000 I always thought that the time was ripe for 64-bit, but no, every single time I try 64-bits linux it let me down, something breaks, and breaks bad. This time it was lazarus. Other times it has been CAD-software, visualization software, media players/codecs, etc, etc, etc.
« Last Edit: January 14, 2019, 05:15:08 pm by NonSpillable »

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Unicode won't work no matter what I do.
« Reply #8 on: January 14, 2019, 05:30:55 pm »
Just curious, as I do not use Linux, do you guys need to add a unit like cwstring to have a WideStringManager?

@NonSpillable,

what do you get for:
DefaultSystemCodePage
DefaultFileSystemCodePage
DefaultRTLFileSystemCodePage?

Are you really using Lazarus: 1.6.2 and not 1.8.2 or that's just a typo?

Your testdir.zip testdir.7z do not look right on my side: Win using 7z.

Edit:
Some of your LC_* values are between quotation marks and some without, I don't know if that makes a difference?

Doing a quick search gave me the impression that the values you see for LC_* in a console could be different than their counterpart for a GUI.
« Last Edit: January 14, 2019, 05:54:47 pm by engkin »

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4468
  • I like bugs.
Re: Unicode won't work no matter what I do.
« Reply #9 on: January 14, 2019, 08:16:29 pm »
Edit: Could there be any problems with EXT4, with linux eller other things, than lazarus? But lazarus is the only thing not working (that is, FindFirst/FindNext).
FindFirst/FindNext work perfectly well. Your directory and file names are just plain wrong.
I don't understand why you don't see it in your Caja file manager.
See the attached screenshot of my Dolphin file manager.

Quote
1) Side note, but since early 2000 I always thought that the time was ripe for 64-bit, but no, every single time I try 64-bits linux it let me down, something breaks, and breaks bad. This time it was lazarus. Other times it has been CAD-software, visualization software, media players/codecs, etc, etc, etc.
Nonsense. I have used 64-bit Lazarus on Linux for about 7-8 years. Works well.
« Last Edit: January 14, 2019, 08:32:47 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4468
  • I like bugs.
Re: Unicode won't work no matter what I do.
« Reply #10 on: January 14, 2019, 08:25:18 pm »
Just curious, as I do not use Linux, do you guys need to add a unit like cwstring to have a WideStringManager?
WideStringManager is only needed on Windows. Why do you mix it here?

Quote
what do you get for:
DefaultSystemCodePage
DefaultFileSystemCodePage
DefaultRTLFileSystemCodePage?
A user does not need to set such things, especially not on Linux. Why do you mix it here?

Quote
Are you really using Lazarus: 1.6.2 and not 1.8.2 or that's just a typo?
Makes no difference. Both use the new UTF-8 system. On Linux it matters even less.

Quote
Your testdir.zip testdir.7z do not look right on my side: Win using 7z.
Exactly, now you are on the right track.

Quote
Some of your LC_* values are between quotation marks and some without, I don't know if that makes a difference?
What LC_* values?
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4468
  • I like bugs.
Re: Unicode won't work no matter what I do.
« Reply #11 on: January 14, 2019, 08:43:10 pm »
In Windows, there are something called "locale", "Locale for non unicode-supporting applications" in full.  Basically this must be the same as the operating system language but I could set it as other language and then characters are not displayed correctly in some applications. Not sure about Linux.
It is specific to Windows. Linux typically uses UTF-8 everywhere although there is no standard for that.
I don't know why you bring the Locales up in a Linux question.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Bart

  • Hero Member
  • *****
  • Posts: 5290
    • Bart en Mariska's Webstek
Re: Unicode won't work no matter what I do.
« Reply #12 on: January 14, 2019, 10:17:32 pm »
WideStringManager is only needed on Windows. Why do you mix it here?
Quote

IIRC then a WideStringManager is needed on Linux as well to have full unicode support.
Otherwise things like WideCompare* won't work.
See the comments in LazUtf8.
(This also means that any LCL program on *nix already has CWString unit enabled.)

Of course this makes no difference at all if the filename is encode wrong in the first place.

Bart

BeniBela

  • Hero Member
  • *****
  • Posts: 906
    • homepage
Re: Unicode won't work no matter what I do.
« Reply #13 on: January 14, 2019, 10:31:45 pm »
Linux filenames have no encoding.  Any byte sequence not containing #0 and '/' is allowed, like an arbitrary C-string.

Btw, I have collected some unusual file names here. A proper file handling needs to support all of them

Bart

  • Hero Member
  • *****
  • Posts: 5290
    • Bart en Mariska's Webstek
Re: Unicode won't work no matter what I do.
« Reply #14 on: January 14, 2019, 10:44:31 pm »
Well, then I mean malformed byte sequences that do not represent any valid UTF8 codepoint.

Do Lazarus file IO routines handle correctly your collection of "nasty files"?

Bart

 

TinyPortal © 2005-2018