Recent

Author Topic: Converting non-ASCII characters to closest ASCII [in Linux]  (Read 6390 times)

fedkad

  • Full Member
  • ***
  • Posts: 178
I have a UTF-8 encoded string (the default in Lazarus) that contains ASCII and non-ASCII characters. The string contains mainly text from Latin-based languages. I want to convert any non-ASCII characters to their closest ASCII character. For example: "ÁÑáñ¢" to "ANanc".

The code:

Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button1Click(Sender: TObject);
  2. type
  3.   USASCIIString = type AnsiString(20127);
  4. begin
  5.   memo2.text := String(USASCIIString(memo1.text));
  6. end;

works for me. However, this only works in Windows. In Linux it returns question marks only.

Do you have any suggestion for a quick (=fast) similar method to convert UTF-8 Latin characters to their nearest ASCII characters (in Linux)?

Notes:
  • I do not want to reinvent the wheel!
  • I am only interested in converting characters for Latin-based languages.
  • Some idiosyncrasies (like https://en.wikipedia.org/wiki/Dotted_and_dotless_I) will be handled separately and specially by my code before this conversion.
« Last Edit: June 22, 2019, 03:54:35 pm by fedkad »
Lazarus 4.0 / FPC 3.2.2 on x86_64-linux-gtk2 (Ubuntu/GNOME) and x86_64-win64-win32/win64 (Windows 11)

Bart

  • Hero Member
  • *****
  • Posts: 5713
    • Bart en Mariska's Webstek
Re: Converting non-ASCII characters to closest ASCII [in Linux]
« Reply #1 on: June 21, 2019, 10:57:18 pm »
I started on one some weeks ago (similar question on the forum).
However there so many characters that can be "translated" to ASCII, I gave up.
And what to do with U+00C6, should that be "translated" to 'AE'?
And the Greek omega: 'oo'? or 'o', chi to 'ch'? etc. etc.

After writing down 8.5 pages of A4 of unicode codepoints (handwriting) I gave up the idea as pretty much not feasible.

B.t.w. did you include the cwstring unit (it installs the widestring manager)?
Maybe it helps.

Bart

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: Converting non-ASCII characters to closest ASCII [in Linux]
« Reply #2 on: June 22, 2019, 12:09:32 am »
Did you have a a look at
fpcsrc/3.0.4/packages/iconvenc


This is a wrapper for the linux utility iconv

Small documentation at
https://wiki.freepascal.org/iconvenc

Winni

engkin

  • Hero Member
  • *****
  • Posts: 3112

fedkad

  • Full Member
  • ***
  • Posts: 178
Re: Converting non-ASCII characters to closest ASCII [in Linux]
« Reply #4 on: June 22, 2019, 12:30:10 pm »
I started on one some weeks ago (similar question on the forum).
However there so many characters that can be "translated" to ASCII, I gave up.
I don't want to reinvent the wheel.

Quote
And what to do with U+00C6, should that be "translated" to 'AE'?
The code I presented in my original question converts it to 'A' (in Windows).

Quote
And the Greek omega: 'oo'? or 'o', chi to 'ch'? etc. etc.
I am only interested in Latin-based characters. Greek, Cyrillic, Arabic, Asian, etc. characters should be converted to '?'

Quote
B.t.w. did you include the cwstring unit (it installs the widestring manager)?
Maybe it helps.
Bart
Including the unit cwstring had no effect! :(
Lazarus 4.0 / FPC 3.2.2 on x86_64-linux-gtk2 (Ubuntu/GNOME) and x86_64-win64-win32/win64 (Windows 11)

fedkad

  • Full Member
  • ***
  • Posts: 178
Re: Converting non-ASCII characters to closest ASCII [in Linux]
« Reply #5 on: June 22, 2019, 12:37:36 pm »
You might want to try this.
https://forum.lazarus.freepascal.org/index.php/topic,44331.msg311606.html#msg311606

It is not exactly, what I am looking for. I want an output string that will contain only ASCII characters. (Any unrecognized characters should be converted to an ASCII character, like '?'.)
Lazarus 4.0 / FPC 3.2.2 on x86_64-linux-gtk2 (Ubuntu/GNOME) and x86_64-win64-win32/win64 (Windows 11)

fedkad

  • Full Member
  • ***
  • Posts: 178
Re: Converting non-ASCII characters to closest ASCII [in Linux]
« Reply #6 on: June 22, 2019, 12:44:52 pm »
Did you have a a look at
fpcsrc/3.0.4/packages/iconvenc

This is a wrapper for the linux utility iconv

Small documentation at
https://wiki.freepascal.org/iconvenc
Winni

Yes. I want something like the Linux command:

Code: [Select]
iconv -f UTF-8 -t ASCII//TRANSLIT
How, can I use it in Lazarus? And would it be CPU efficient? I will use it in a loop that will do an "ASCII-based" search in large files that contain non-ASCII text.
« Last Edit: June 22, 2019, 12:47:30 pm by fedkad »
Lazarus 4.0 / FPC 3.2.2 on x86_64-linux-gtk2 (Ubuntu/GNOME) and x86_64-win64-win32/win64 (Windows 11)

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Converting non-ASCII characters to closest ASCII [in Linux]
« Reply #7 on: June 22, 2019, 12:52:44 pm »
You might want to try this.
https://forum.lazarus.freepascal.org/index.php/topic,44331.msg311606.html#msg311606

It is not exactly, what I am looking for. I want an output string that will contain only ASCII characters. (Any unrecognized characters should be converted to an ASCII character, like '?'.)

Since the string is UTF8 encoded (before and after using this function), any char>#7F is not ASCII.

dbannon

  • Hero Member
  • *****
  • Posts: 3777
    • tomboy-ng, a rewrite of the classic Tomboy
Re: Converting non-ASCII characters to closest ASCII [in Linux]
« Reply #8 on: June 22, 2019, 01:26:45 pm »
It is not exactly, what I am looking for. I want an output string that will contain only ASCII characters. (Any unrecognized characters should be converted to an ASCII character, like '?'.)

See https://wiki.freepascal.org/UTF8_strings_and_characters

All you want to do is copy the UTF8 string to another string that contains just ASCII char replacing any one that is not already ASCII with '?'   ?

Iterate over the first string, one UTF8 char at at time ( https://wiki.freepascal.org/UTF8_strings_and_characters#Iterating_over_string_analysing_individual_codepoints ), each time, if its  single byte add it to the new string, if its not, add '?' to the new string.  Er, if its already '?' thats messy ....

Now, actually translating the multibyte char to an 'equivalent' ascii character is another matter altogether. You need to define what you mean by equivalent I'm afraid.

Davo
Lazarus 3, Linux (and reluctantly Win10/11, OSX Monterey)
My Project - https://github.com/tomboy-notes/tomboy-ng and my github - https://github.com/davidbannon

fedkad

  • Full Member
  • ***
  • Posts: 178
Re: Converting non-ASCII characters to closest ASCII [in Linux]
« Reply #9 on: June 22, 2019, 01:30:54 pm »
Quote
Now, actually translating the multibyte char to an 'equivalent' ascii character is another matter altogether. You need to define what you mean by equivalent I'm afraid.

I already defined what I need (something similar to iconv -f UTF-8 -t ASCII//TRANSLIT). It should be very efficient CPU-wise. See my posts above.
Lazarus 4.0 / FPC 3.2.2 on x86_64-linux-gtk2 (Ubuntu/GNOME) and x86_64-win64-win32/win64 (Windows 11)

dbannon

  • Hero Member
  • *****
  • Posts: 3777
    • tomboy-ng, a rewrite of the classic Tomboy
Re: Converting non-ASCII characters to closest ASCII [in Linux]
« Reply #10 on: June 22, 2019, 01:48:58 pm »
OK, you could call iconv as an external process ? Would be slower than a conversion within your own code but that would only be a problem if you have a lot of char to convert. 

I think it would be a bit ugly too !

I guess iconv uses a look up table, maybe you need to have a look at that table .....

Davo

Lazarus 3, Linux (and reluctantly Win10/11, OSX Monterey)
My Project - https://github.com/tomboy-notes/tomboy-ng and my github - https://github.com/davidbannon

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Converting non-ASCII characters to closest ASCII [in Linux]
« Reply #11 on: June 22, 2019, 02:17:18 pm »
iconv -f UTF-8 -t ASCII//TRANSLIT will change a Euro sign into EUR. So this test "áéíóúüþëé€" will become "aeiouu?eeEUR"

iconv function in available in FPC in unit xmliconv and xmliconv_windows. Of course you need to have the library iconv.dll/.so on your system.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Converting non-ASCII characters to closest ASCII [in Linux]
« Reply #12 on: June 22, 2019, 03:35:27 pm »
I don't know if this still works on newer Windows. CharMap utility gets names of letters using a dll named GetUName:
Code: Pascal  [Select][+][-]
  1. program Project1;
  2.  
  3. {$mode objfpc}{$H+}
  4. {$Codepage UTF8}
  5.  
  6. uses
  7.   Classes, LazUTF16
  8.   { you can add units after this };
  9.  
  10.  
  11. //Pass a unicode value and get the name of the corresponding codepoint,
  12. function GetUName(AUnicode: Cardinal; ABuf:array of WideChar):integer;stdcall;external 'GetUName.dll';
  13.  
  14. function GetSys]"]>BlockedName(AUnicode: Cardinal):UnicodeString;
  15. var
  16.   Buf:array[0..511] of WideChar;
  17. begin
  18.   GetUName(AUnicode, Buf);
  19.   Result := UnicodeString(Buf);
  20. end;
  21.  
  22. var
  23.   s: String;
  24.   u: UnicodeString;
  25.   c: cardinal = 0;
  26.   Len: Integer;
  27.   p, pEnd: PWideChar;
  28. begin
  29.   SetMultiByteConversionCodePage(CP_UTF8);
  30.   s := 'áéíóúüþëé€';
  31.   u := UnicodeString(s);
  32.   p := @u[1];
  33.   pEnd := p+Length(u);
  34.   while pEnd>p do
  35.   begin
  36.     c := UTF16CharacterToUnicode(p, Len);
  37.     WriteLn(GetSys]"]>BlockedName(c));
  38.     inc(p, Len)
  39.   end;
  40.   ReadLn;
  41. end.

for áéíóúüþëé€ it gives:
Quote
Latin Small Letter A With Acute
Latin Small Letter E With Acute
Latin Small Letter I With Acute
Latin Small Letter O With Acute
Latin Small Letter U With Acute
Latin Small Letter U With Diaeresis
Latin Small Letter Thorn
Latin Small Letter E With Diaeresis
Latin Small Letter E With Acute
Euro Sign

You decide if it helps.

fedkad

  • Full Member
  • ***
  • Posts: 178
Re: Converting non-ASCII characters to closest ASCII [in Linux]
« Reply #13 on: June 22, 2019, 03:45:12 pm »
Quote
I don't know if this still works on newer Windows. CharMap utility gets names of letters using a dll named GetUName:

Thanks. But, my question is about Linux. I will use the code I provided in my fist post in Windows. I am looking for an equivalent solution for Linux.
Lazarus 4.0 / FPC 3.2.2 on x86_64-linux-gtk2 (Ubuntu/GNOME) and x86_64-win64-win32/win64 (Windows 11)

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Converting non-ASCII characters to closest ASCII [in Linux]
« Reply #14 on: June 22, 2019, 03:48:09 pm »
What about libiconv

 

TinyPortal © 2005-2018