Recent

Author Topic: paramstr and UTF8 on Windows  (Read 15159 times)

giorgiotani

  • Guest
paramstr and UTF8 on Windows
« on: February 04, 2010, 11:54:50 am »
I've noticed that on Windows platform when I poll for paramstr they comes as ansistring (using OBJPAS  unit as it is default for Lazarus), and any extended character outside the current machine's codepage (i.e. Cyrillic character on a western system) are codes as "?" (the ord of the character is 63).
This is a quite severe problem for internationalizing an application, since it can receive i.e. a paramstr containing a Windows' filename (widestring) containing characters outside the machine's codepage.
The information about those character is lost, being those characters replaced by fixed character 63 is not even possible to attempt to convert them.
Is it possible to get paramstr in another form, either as widestring or better as UTF8string?
There are some libraries capable to add this feature to Lazarus/FPC, or is it planned to be introduced in future?

As a related issue FindFirst and related functions on Windows seems AFAIK only pointing only to the Ansi (legacy) version of the Windows API, while widestring alternatives are featured by Windows API.
Would it be possible to (or is it planned to, or it is known a worklaround I ignore) provide file search functions pointing to widestring APIs in order to correctly handle filenames outside of the machine's codepage?
In that way all the information about filenames is preserved, and it can be successfully encoded in utf8 which is becoming a widely accepted standard (on the web, on *x, as main string format in Lazarus GUI etc)

Thanks in advance for any response.

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1927

giorgiotani

  • Guest
Re: paramstr and UTF8 on Windows
« Reply #2 on: February 04, 2010, 03:18:02 pm »
Thanks for the reference, but the fileutil, as is implemented, does not resolve the problem with whidechars out of machine's codepage, i.e.
- SysToUTF8 simply encode the string of chars with AnsiToUTF8, which is limited to encoding characters featured in the codepage
- find functions (i.e. FindNextUTF8) simply translate the UTF8 argument to ansi and then invoke the ansi procedure FindNext, which relies on FindNextFile (ANSI) API and write the result to the ANSI TSearchRec structure, and finally encode the result as UTF8, but af course any of those passages fails to encode characters outside codepage, replacing them with ?, char 63.
It would be needed all the functions to rely on "W" (widestring) versions of Windows API to allow handling filenames containing characters outside the machine's codepage (I know it's due to Windows which messed up things supporting two types of encoding rather than relying on something more flexible like UTF8, and don't providing a native UTF8 version of the APIs as they for widestring encoding, when they replicated all "A" APIs with a "W" counterpart).

The problem with paramstr is related to this, since it is get from Lazarus/FPC applications only in form of ANSIstring, which is not capable to keep the information about characters outside codepage (conversion fails and give char 63 as result) even if the system can correctly handle those filenames by widestring encoding.

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1927
Re: paramstr and UTF8 on Windows
« Reply #3 on: February 04, 2010, 04:29:57 pm »
Ah, sorry.
Yes this is a known problem.
See: http://bugs.freepascal.org/view.php?id=15642

giorgiotani

  • Guest
Re: paramstr and UTF8 on Windows
« Reply #4 on: February 04, 2010, 04:39:35 pm »
Thank you very much for the information, I hope it will be supported in future since it is a major limitation for true internationalization on Windows platform.

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1927
Re: paramstr and UTF8 on Windows
« Reply #5 on: February 04, 2010, 09:25:19 pm »
Thank you very much for the information, I hope it will be supported in future since it is a major limitation for true internationalization on Windows platform.

You could write this code for lazarus ;-)
It's quite simple to get the commandline in UTF8 on windows:

Code: [Select]
uses windows
..
Edit1.text:=UTF8Encode(WideString(GetCommandLineW));

You could parse this string now, similar to the code in System.setup_arguments

There is also a patch for findfirst etc. in fileutil attached here:
http://bugs.freepascal.org/view.php?id=15642
You might test this code, I don't know if it works.
See:
http://wiki.lazarus.freepascal.org/Creating_A_Patch#Applying_a_patch
« Last Edit: February 04, 2010, 09:32:31 pm by theo »

giorgiotani

  • Guest
Re: paramstr and UTF8 on Windows
« Reply #6 on: February 04, 2010, 11:07:03 pm »
Good suggestions, I'll try both to apply the patch and to parse GetCommandLineW.
IMHO it would be great to see this in out-of-the box IDE.

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1927
Re: paramstr and UTF8 on Windows
« Reply #7 on: February 05, 2010, 09:54:47 am »
IMHO it would be great to see this in out-of-the box IDE.

Absolutely. If somebody puts it into the box... ;-)
This is a community project.

You can read how to contribute code to the lazarus codebase on this page:
http://wiki.lazarus.freepascal.org/Creating_A_Patch
« Last Edit: February 05, 2010, 09:56:32 am by theo »

giorgiotani

  • Guest
Re: paramstr and UTF8 on Windows
« Reply #8 on: February 06, 2010, 03:08:43 pm »
I've tested with success a parser for GetCommandLineW, but when I need to work on files/folders, those filenames runs everywhere on underlying FPC's RTL functions ansi limitation.
When the name is used in any of the common functions related to files and folders, like in example fileexists, assignfile, extractfilepath/name/ext, setcurrentdir, mkdir, etc... it needs to be converted to ANSI, which loses the information about extended characters out of codepage replacing them with char 63.
All of those functions would need to be patched calling Windows' "W" API couterparts, which would allow to successfully convert the information in the UTF8string also in those cases.
I thank you very much for the information provided in this thread, but as for what I currently understand adding Lazarus/FPC the ability to fully handle a filesystem with extended chars on Windows would be a big task and would require some programmers really fond of RTL development.

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1927
Re: paramstr and UTF8 on Windows
« Reply #9 on: February 06, 2010, 03:25:18 pm »
Yes, it's a bad situation currently. The FPC team is waiting for a solution for UnicodeString and is not going to change the RTL using the current string types afaik.

So a temporary (partial) solution in the LCL seems the only way for now, because afaik nobody really knows when UnicodeString will be ready.

http://lists.freepascal.org/lists/fpc-devel/2010-February/019252.html

 

TinyPortal © 2005-2018