Recent

Author Topic: Using utf8toansi in Linux  (Read 2559 times)

fedkad

  • Full Member
  • ***
  • Posts: 178
Using utf8toansi in Linux
« on: June 30, 2019, 12:35:50 pm »
Using the function s2:=utf8toansi(s1); in Linux seems to have no effect. The same (UTF-8) string is returned.

The page that describes this function (https://www.freepascal.org/docs-html/rtl/system/utf8toansi.html) says:

Quote
For this function to work, a widestring manager must be installed.

On page https://www.freepascal.org/docs-html/rtl/system/setwidestringmanager.html it is said that:

Quote
On Unix and Linux, an implementation based on the C library is available (in unit cwstring).

However, I have no idea on how to use SetWideStringManager.

A working example for Linux will be highly appreciated.
Lazarus 4.0 / FPC 3.2.2 on x86_64-linux-gtk2 (Ubuntu/GNOME) and x86_64-win64-win32/win64 (Windows 11)

Thaddy

  • Hero Member
  • *****
  • Posts: 18975
  • Glad to be alive.
Re: Using utf8toansi in Linux
« Reply #1 on: June 30, 2019, 01:14:28 pm »
It is as easy as including unit cwstrings in the uses clause.

Well, that was easy to answer..... 8-) You probably did not fully understood the documentation.
And cwstrings is unix... for windows this is not necessary.
SetWideStringManager is automatically called when cwstrings is included.
« Last Edit: June 30, 2019, 01:17:49 pm by Thaddy »
Recovered from removal of tumor in tongue following tongue reconstruction with a part from my leg.

lucamar

  • Hero Member
  • *****
  • Posts: 4217
Re: Using utf8toansi in Linux
« Reply #2 on: June 30, 2019, 01:36:41 pm »
Note that using cwstring links your program to the C library. If you don't want that you can instead use fpwidestring, which is native Pascal. Although it's not as "automatic" as using cwstring (the default collation has to be set manually).

Because their simplicity they are both well documented. Not much to it, really :)
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

fedkad

  • Full Member
  • ***
  • Posts: 178
Re: Using utf8toansi in Linux
« Reply #3 on: June 30, 2019, 01:47:44 pm »
Thaddy,

The unit name is called cwstring and I did include it on top of my uses section.

However, the code: length(utf8toansi('ç')) returns 2. Or, the code: length(utf8toansi('x')) returns 5.

Similarly, the code: utf8toansi('ç')='ç' returns True. Or, the code: utf8toansi('x')='x' returns True.

Note: In the last examples 'x' is a smiling face: Unicode hex code $1F929. The forum software had problems including this character.
Lazarus 4.0 / FPC 3.2.2 on x86_64-linux-gtk2 (Ubuntu/GNOME) and x86_64-win64-win32/win64 (Windows 11)

Zoran

  • Hero Member
  • *****
  • Posts: 1988
    • http://wiki.lazarus.freepascal.org/User:Zoran
Re: Using utf8toansi in Linux
« Reply #4 on: June 30, 2019, 01:58:09 pm »
Fedkad, you didn't understand what "Ansi" in this function means.
Utf8ToAnsi is a function which converts UTF8 encoded string to "Ansi" encoding, which here actually means to "system code page".

In Windows, there is one ansi code page (really ansi - one byte) which system has as its default one-byte encoding.
In most Linux distors, the "system code page" is utf8.

So, normally, this function has no effect in Linux, it converts utf8 to utf8.

If you have a utf8 string that you should convert to some specific code page, you should use this:
Code: Pascal  [Select][+][-]
  1. function ConvertToCP1250(const S: RawByteString): RawByteString;
  2. begin
  3.   Result := S;
  4.   SetCodePage(Result, cp_utf8, False);
  5.   SetCodePage(Result, 1250, True);
  6. end;
  7.  

Of course, replace 1250, with the encoding you actually want.

And you should know to which encoding you want to convert, even in Windows it is bad idea to rely on "system installed" code page.
« Last Edit: June 30, 2019, 02:03:13 pm by Zoran »
Swan, ZX Spectrum emulator https://github.com/zoran-vucenovic/swan

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12776
  • FPC developer.
Re: Using utf8toansi in Linux
« Reply #5 on: June 30, 2019, 02:16:37 pm »
Zoran says what I was thinking too.

Better explain what you are trying to do. If you have a database that delivers some windows ansi encoding, then the best way is to tweak the connection or database itself to convert it on the fly to utf8 on *nix.

Doing manual conversions should be a last resort.

fedkad

  • Full Member
  • ***
  • Posts: 178
Re: Using utf8toansi in Linux
« Reply #6 on: June 30, 2019, 02:34:57 pm »
Thanks Zoran for the clarification and the example!

Marcov: I am actually trying to find a way to convert a UTF-8 string to its "closest" ASCII version. The code I was using in Windows was something like:

Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button1Click(Sender: TObject);
  2. type
  3.   USASCIIString = type AnsiString(20127);
  4. begin
  5.   memo2.text := String(USASCIIString(memo1.text));
  6. end;

However, this does not work in Linux, because it returns question marks for every non-ASCII character. In a way, it acts like Zoran's example with code page set to 20127. In Windows it was replacing some non-ASCII Latin-based letters to their "closest" ASCII letters, which is something I want to do in Linux also.

This is discussed in topic: https://forum.lazarus.freepascal.org/index.php/topic,45802.0.html
« Last Edit: June 30, 2019, 02:39:01 pm by fedkad »
Lazarus 4.0 / FPC 3.2.2 on x86_64-linux-gtk2 (Ubuntu/GNOME) and x86_64-win64-win32/win64 (Windows 11)

Zoran

  • Hero Member
  • *****
  • Posts: 1988
    • http://wiki.lazarus.freepascal.org/User:Zoran
Re: Using utf8toansi in Linux
« Reply #7 on: June 30, 2019, 03:59:25 pm »
Try with CP_ASCII, so like this:

Code: Pascal  [Select][+][-]
  1. function ConvertToASCII(const S: RawByteString): RawByteString;
  2. begin
  3.   Result := S;
  4.   SetCodePage(Result, CP_UTF8, False);
  5.   SetCodePage(Result, CP_ASCII, True);
  6. end;
  7.  

What do you get with this?

For latin letters, it should remove accents and get you the "closest" ascii letter.
For other characters (eg. cyrilic or greek letters), it just returns question marks, though.

EDIT:
Now I see that the constant CP_ASCII is actually declared as 20127, which you tried already...  :'(
« Last Edit: June 30, 2019, 04:07:56 pm by Zoran »
Swan, ZX Spectrum emulator https://github.com/zoran-vucenovic/swan

fedkad

  • Full Member
  • ***
  • Posts: 178
Re: Using utf8toansi in Linux
« Reply #8 on: June 30, 2019, 04:05:53 pm »
Zoran: In Windows it works exactly as you described and just as my example I mentioned above.

However, in Linux it returns question marks for everything except ASCII characters, again giving similar output as my example.
Lazarus 4.0 / FPC 3.2.2 on x86_64-linux-gtk2 (Ubuntu/GNOME) and x86_64-win64-win32/win64 (Windows 11)

Zoran

  • Hero Member
  • *****
  • Posts: 1988
    • http://wiki.lazarus.freepascal.org/User:Zoran
Re: Using utf8toansi in Linux
« Reply #9 on: June 30, 2019, 04:11:28 pm »
Zoran: In Windows it works exactly as you described and just as my example I mentioned above.

However, in Linux it returns question marks for everything except ASCII characters, again giving similar output as my example.

Yes, I am currently on Windows, sorry.
And I see now that constant CP_ASCII is declared as 20127, which you had already tried.
Swan, ZX Spectrum emulator https://github.com/zoran-vucenovic/swan

 

TinyPortal © 2005-2018