Lazarus

Free Pascal => Beginners => Topic started by: JLWest on September 19, 2019, 08:13:55 pm

Title: Character Conversions
Post by: JLWest on September 19, 2019, 08:13:55 pm
Trying to convert Non-ASCII words to ASCII words without success. Most of the words will be UTF8 but some of it is Greek, Arabic and who knows what. About 11 to 15 mill words. Some are already ASCII.

I dont expect anyone to write the code just give me an idea where to start.

I have read quite a bit on this but I don't understand how to implement it in code.


Code: Pascal  [Select][+][-]
  1. function TForm1.ToASCII(ASTRING : String) : String;
  2.  Var AWord : AnsiString;
  3.  Begin
  4.   AWord : AString;
  5.  
  6.   What has to go here to acheive this;
  7.  
  8.   Result := AWord;
  9.  end;
Title: Re: Character Conversions
Post by: Handoko on September 19, 2019, 08:30:20 pm
I think I can understand what you said. But can you please provide examples what are the inputs and the outputs. So I can work based on your examples.
Title: Re: Character Conversions
Post by: winni on September 19, 2019, 08:44:52 pm
Have a look at this dicussion:

https://forum.lazarus.freepascal.org/index.php/topic,45802.msg324361.html?PHPSESSID=4daqvc2e146snr2c9f5dd2trf4#msg324361 (https://forum.lazarus.freepascal.org/index.php/topic,45802.msg324361.html?PHPSESSID=4daqvc2e146snr2c9f5dd2trf4#msg324361)

Winni
Title: Re: Character Conversions
Post by: JLWest on September 19, 2019, 09:00:18 pm
I think I can understand what you said. But can you please provide examples what are the inputs and the outputs. So I can work based on your examples.

Input:                                            Output
Les Bruyères                                Les Bruyeres
Centre Médical Héliporté               Centre Medical Heliporte
Vésale Heliport
Saïss Airport
Fès-Boulemane
Léopold
Kédougou
Cesária
Évora
São
Ploče
Otočac
Čakovec
Almería
León
León
Logroño-Agoncillo
Suárez
Compiègne
Tréport
Périgueux
Targé
Châtellerault
Épernay
Pápa
Pécs-Pogány
Győr-Pér
Pér


Hope this answers your question. The list on the left is the who knows and the right are ASCII
Code: Text  [Select][+][-]
  1. Les Bruyères
  2. Centre Médical Héliporté
  3. Vésale Heliport
  4. Saïss Airport
  5. Fès-Boulemane
  6. Léopold
  7. Kédougou
  8. Cesária
  9. Évora
  10. São
  11. Ploče
  12. Otočac
  13. Čakovec
  14. Almería
  15. León
  16. León
  17. Logroño-Agoncillo
  18. Suárez
  19. Compiègne
  20. Tréport
  21. Périgueux
  22. Targé
  23. Châtellerault
  24. Épernay
  25. Pápa
  26. Pécs-Pogány
  27. Győr-Pér
  28. Pér
  29.  
  30.  
  31.  
  32.  
Title: Re: Character Conversions
Post by: howardpc on September 19, 2019, 11:05:25 pm
Try the following:
Code: Pascal  [Select][+][-]
  1. program Project1;
  2.  
  3. {$mode objfpc}{$H+}{$IfDef windows}
  4. {$AppType console}
  5. {$EndIf}
  6.  
  7. uses
  8.    iconvenc, Types, LazUTF8;
  9.  
  10. function ConvertToAscii(aUTF8Text: String): String;
  11. var
  12.   s, tmp: String;
  13.   p: PChar;
  14.   pEnd: PChar;
  15.   i, j: Integer;
  16. begin
  17.   Result := '';
  18.   p := PChar(aUTF8Text);
  19.   pEnd := p;
  20.   Inc(pEnd, Length(aUTF8Text));
  21.   repeat
  22.     i := UTF8CodepointSize(p);
  23.     case i of
  24.       1: Result += p^;
  25.       else
  26.         begin
  27.           SetLength(s, i);
  28.           for j := 1 to i do
  29.             begin
  30.               Inc(p, j-1);
  31.               s[j] := p^;
  32.             end;
  33.           Iconvert(s, tmp, 'UTF-8', 'ASCII//TRANSLIT');
  34.           Result += tmp[1];
  35.         end;
  36.     end;
  37.     Inc(p);
  38.   until p >= pEnd;
  39. end;
  40.  
  41. var
  42.   strs: TStringDynArray;
  43.   s: String;
  44.  
  45. begin
  46.   strs := TStringDynArray.Create('Les Bruyères', 'Centre Médical Héliporté',
  47.                                  'Vésale Heliport', 'Saïss Airport',
  48.                                  'Fès-Boulemane', 'Léopold', 'Kédougou',
  49.                                  'Cesária', 'Évora', 'São', 'Ploče', 'Otočac',
  50.                                  'Čakovec', 'Almería', 'León', 'León',
  51.                                  'Logroño-Agoncillo', 'Suárez', 'Compiègne',
  52.                                  'Tréport', 'Périgueux', 'Targé', 'Châtellerault',
  53.                                  'Épernay', 'Pápa', 'Pécs-Pogány', 'Győr-Pér', 'Pér');
  54.   for s in strs do
  55.     WriteLn(ConvertToAscii(s));
  56.   Readln;
  57. end.

It gives the following output on Linux (not tested on Windows):
Code: Pascal  [Select][+][-]
  1. Les Bruyeres
  2. Centre Medical Heliporte
  3. Vesale Heliport
  4. Saiss Airport
  5. Fes-Boulemane
  6. Leopold
  7. Kedougou
  8. Cesaria
  9. Evora
  10. Sao
  11. Ploce
  12. Otocac
  13. Cakovec
  14. Almeria
  15. Leon
  16. Leon
  17. Logrono-Agoncillo
  18. Suarez
  19. Compiegne
  20. Treport
  21. Perigueux
  22. Targe
  23. Chatellerault
  24. Epernay
  25. Papa
  26. Pecs-Pogany
  27. Gyor-Per
  28. Per
Title: Re: Character Conversions
Post by: Birger52 on September 20, 2019, 12:12:58 am
Why?

ASCII only has 127 charaters.
http://www.asciitable.com/
So what you want is not possible.

Some extended "ASCII character" sets, has some "foreign" characters - but they are not strict ASCII, (8 bit in contrast to ASCII's 7 - the extra character nbrs 128-255) and you will need to interpret the result according to witch set has the character(s) you want.
(Code pages - https://en.wikipedia.org/wiki/Code_page or https://docs.microsoft.com/en-us/windows/win32/intl/code-page-identifiers)
Title: Re: Character Conversions
Post by: JLWest on September 20, 2019, 12:59:28 am


Ok
Lets Say up to 255, the extended character set.

Why is a long Story.

I'm Trying to work with some very large (11 Million Lines) and old (20 years)  text files. They were submitted by users from all over the world with all kinds of characters sets. They don't display right, sort right and if statements don't always work.

If Parm = 'X' then do-sopmething; This statement didn't work in a program because Parm displays as an 'X' but is a different character set. I edit the file and can make it work but that's not a a solution on 11 mil lines.

Have to convert to something.

The Parm value was read in from file. 

" Var
Title: Re: Character Conversions
Post by: winni on September 20, 2019, 01:09:51 am
@Birger52

Yes, that's all true.

But the question was to do the "imposiible".

There are situations where pure ASCII-7bit is needed. And nothing else. Most times because of old  Software. In Germany some Banks are using still today software, which is not able to handle äöüÄÖÜ! A customer called "Müller" is allways printed as "M ller""!!!!

So for Europe it is the question how to get rid of all those little specialities above and sometimes  below the characters. To make it readable and not to replace it with a gap.

Because of this reasons there a geo databases around with two ( or more) fields for the name of a city. Field1 is Local Name in UTF8. Second Field2 is Name in ASCII.

Hope this clarifies the situation.

Winni

Title: Re: Character Conversions
Post by: dbannon on September 20, 2019, 03:49:14 am
A customer called "Müller" is allways printed as "M ller""!!!!

Indeed. Now, you could, for example, replace Müller with Muller by having a look up table that replaces unicode characters with an acceptable approximation. By if I was Mr Müller I think I'd be even more upset, leaving a space sort of acknowledges its wrong, using a "u" is renaming that person. Thats what Iconvert() does perhaps ?

And, there are very, very many unicode characters that don't have a reasonable approximation at all. So, you end up with text that contains, eg a "?" or a space as winni mentions.  Sort of readable, it would be sortable and consistent. But ugly.

Spotting the UTF8 characters is easy, see https://wiki.freepascal.org/UTF8_strings_and_characters - its a policy decision what to do with them, not a coding one.

Davo
Title: Re: Character Conversions
Post by: howardpc on September 20, 2019, 05:09:19 am
Note that the code I cobbled together is for valid utf8 text. Visual inspection of the small data sample you provided showed it was OK.
For unseen data from unknown sources (such as processing text served from an online database) a more robust, but slower, routine would need to insert
Code: Pascal  [Select][+][-]
  1. UTF8FixBroken(aUTF8Text);
 

as the first line.
Of course output from invalid utf8 text is at best ?? and at worst simply garbage. However, the routine should not crash if fed unsuitable data.
Title: Re: Character Conversions
Post by: JLWest on September 20, 2019, 07:03:31 am
Note that the code I cobbled together is for valid utf8 text. Visual inspection of the small data sample you provided showed it was OK.
For unseen data from unknown sources (such as processing text served from an online database) a more robust, but slower, routine would need to insert
Code: Pascal  [Select][+][-]
  1. UTF8FixBroken(aUTF8Text);
 

as the first line.
Of course output from invalid utf8 text is at best ?? and at worst simply garbage. However, the routine should not crash if fed unsuitable data.


I can't get the code to compile.

It gives me an error on line 33.
  Iconvert(s, tmp, 'UTF-8', 'ASCII//TRANSLIT');  <-- can't find this

It appears to be a Unix thing.

Any ideas.
Title: Re: Character Conversions
Post by: howardpc on September 20, 2019, 10:09:13 am
For Windows you could try the open source GnuWin32 library, which provides dlls which include iconv as on Linux.
I have not tried this myself, but see here (http://gnuwin32.sourceforge.net/packages/libiconv.htm)
There may be some other built-in Windows solution that I am not aware of. These days I only use Windows if I am forced to.
However, the majority of forum users are Windows users (well, the majority of Lazarus/fpc downloaders are Windows users which would tend to indicate a similar ratio for the forum), and so others may offer simpler solutions.
Title: Re: Character Conversions
Post by: Birger52 on September 20, 2019, 11:26:46 am
I still don't get it.

You can not convert utf8 to ASCII - other than the first 127 characters.
What you want is to represent the non-ASCII characters with some representation of the character that can be done with ASCII characters, making them readable.
Like f.ex. å in danish can be represented by aa.

Seems like the way to go then, would be a table lookup, for the individual characters.
There is 1,112,064 utf8 characters, and you would probably have to create the table yourself.
Still only one tenth of having to correct all the lines manually...
;)
Could maybe be simplified by reading the string as bytes...



Title: Re: Character Conversions
Post by: howardpc on September 20, 2019, 12:32:53 pm
I still don't get it.

You can not convert utf8 to ASCII - other than the first 127 characters.
What you want is to represent the non-ASCII characters with some representation of the character that can be done with ASCII characters, making them readable.
Like f.ex. å in danish can be represented by aa.
You cannot "convert" the first 127 utf8 codepoints to ASCII.

They are already ASCII.

What JLWest is after is a "conversion" of other utf8 codepoints to degrade them to look as close as possible to a single existing ASCII character.
Obviously it is not possible to do this with most of the unicode range. That is a given.

So in this limited exercise, the Danish  å would become a (not aa).The emoji  %) would be omitted (or simply produce garbage).

But that does not matter. It was never intended that this exercise should include emojis or countless other codepoints that lack an obvious ASCII "equivalent".

Title: Re: Character Conversions
Post by: bytebites on September 20, 2019, 03:36:00 pm
Does foldstringw-function help?
https://docs.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-foldstringw
Title: Re: Character Conversions
Post by: Birger52 on September 20, 2019, 04:02:28 pm
I think PHP  actually can do what you want.
https://www.php.net/manual/en/function.mb-convert-encoding.php

Title: Re: Character Conversions
Post by: marcov on September 20, 2019, 04:21:56 pm
I can't get the code to compile.

It gives me an error on line 33.
  Iconvert(s, tmp, 'UTF-8', 'ASCII//TRANSLIT');  <-- can't find this

It appears to be a Unix thing.

Iconv doesn't come with WIndows, it is an additional DLL.

There are variants of that DLL though, depending on which Unix-emulation-for-windows you use (mingw/cygwin etc), and of course 32-bit and 64-bit.

Moreover, it seems that the exact workings of the lib are a bit different on windows (or only some versions?) see https://bugs.freepascal.org/view.php?id=20531 .

This is why the iconv interface unit hasn't been enabled for Windows. Nobody tested it or provides fairly universally accepted dlls.

Personally I would see if I could make a mapping based on the unicode tables (shipped with FPC as part of the rtl-unicode package which plugs into unit character)

Note that rules of removal for accents might depend on language/country.
Title: Re: Character Conversions
Post by: winni on September 20, 2019, 05:17:41 pm
@marcov

Yes, a table to simplify UTF8 down to ASCII would be great. And for this issue we dont need codepoints and other stuff. Just this:

Code: Pascal  [Select][+][-]
  1. TUtf8toASCII = record
  2.                             CharIn : TUTF8Char;
  3.                             CharOut: Char;
  4.                          end;
  5. TUtf8toASCIIArray = array of  TUtf8toASCII;
  6.  
And then fill the array with data. Is this a hard job or is the data somewhere around in the RTL?

Winni
Title: Re: Character Conversions
Post by: marcov on September 20, 2019, 05:28:04 pm
Unicode systems are generally utf16, but that is not really the problem.

The units with tables are in packages/rtl-unicode and the generators in utils/unicode.

The original tables can be downloaded from the Unicode consortium. Maybe you can modify one of the generators to create the table you want. 

Title: Re: Character Conversions
Post by: JLWest on September 20, 2019, 07:48:24 pm
@marcov

Yes, a table to simplify UTF8 down to ASCII would be great. And for this issue we dont need codepoints and other stuff. Just this:

Code: Pascal  [Select][+][-]
  1. TUtf8toASCII = record
  2.                             CharIn : TUTF8Char;
  3.                             CharOut: Char;
  4.                          end;
  5. TUtf8toASCIIArray = array of  TUtf8toASCII;
  6.  
And then fill the array with data. Is this a hard job or is the data somewhere around in the RTL?

Winni

So if I understand this right you are proposing an array of records with UTF8 Char and ASCII. Chars. Then do a search of the array based on the UTF8 char and replace it with the ASCII. Might work in most cases for most words. 

 
Title: Re: Character Conversions
Post by: JLWest on September 20, 2019, 07:50:50 pm
I think PHP  actually can do what you want.
https://www.php.net/manual/en/function.mb-convert-encoding.php

May it could. I couldn't tell and I don't know PHP.
Title: Re: Character Conversions
Post by: winni on September 20, 2019, 08:18:20 pm
@JLWest

Yes, you got me!

Winni
Title: Re: Character Conversions
Post by: howardpc on September 21, 2019, 02:02:06 pm
Here's an example using a simple function that should work on Windows (not tested) which does not use iconvenc.

The NameChrToASCII function works fine on Linux, and has no linux-specific dependencies so should be OK on Windows. it is far from comprehensive, but certainly covers all accented Unicode codepoints in JLWest's example.

Code: Pascal  [Select][+][-]
  1. program TestUTF8ToASCII;
  2.  
  3. {$mode objfpc}{$H+}
  4. {$IfDef windows}
  5. {$AppType console}
  6. {$EndIf}
  7.  
  8. uses
  9.   Classes, LazUTF8, Types;
  10.  
  11. function NameChrToASCII(aUTF8Codepoint: String): Char;
  12. begin
  13.   if Length(aUTF8Codepoint) > 2 then
  14.     Exit('?');
  15.   case aUTF8Codepoint of
  16.     'À': Exit('A');
  17.     'Á': Exit('A');
  18.     'Â': Exit('A');
  19.     'Ã': Exit('A');
  20.     'Ä': Exit('A');
  21.     'Å': Exit('A');
  22.     'Æ': Exit('A');
  23.     'Ç': Exit('C');
  24.     'È': Exit('E');
  25.     'É': Exit('E');
  26.     'Ê': Exit('E');
  27.     'Ë': Exit('E');
  28.     'Ì': Exit('I');
  29.     'Í': Exit('I');
  30.     'Î': Exit('I');
  31.     'Ï': Exit('I');
  32.     'Ð': Exit('D');
  33.     'Ñ': Exit('N');
  34.     'Ò': Exit('O');
  35.     'Ó': Exit('O');
  36.     'Ô': Exit('O');
  37.     'Õ': Exit('O');
  38.     'Ö': Exit('O');
  39.     '×': Exit('x');
  40.     'Ø': Exit('O');
  41.     'Ù': Exit('U');
  42.     'Ú': Exit('U');
  43.     'Û': Exit('U');
  44.     'Ü': Exit('U');
  45.     'Ý': Exit('Y');
  46.     'Þ': Exit('T');
  47.     'ß': Exit('s');
  48.     'à': Exit('a');
  49.     'á': Exit('a');
  50.     'â': Exit('a');
  51.     'ã': Exit('a');
  52.     'ä': Exit('a');
  53.     'å': Exit('a');
  54.     'æ': Exit('a');
  55.     'ç': Exit('c');
  56.     'è': Exit('e');
  57.     'é': Exit('e');
  58.     'ê': Exit('e');
  59.     'ë': Exit('e');
  60.     'ì': Exit('i');
  61.     'í': Exit('i');
  62.     'î': Exit('i');
  63.     'ï': Exit('i');
  64.     'ð': Exit('d');
  65.     'ñ': Exit('n');
  66.     'ò': Exit('o');
  67.     'ó': Exit('o');
  68.     'ô': Exit('o');
  69.     'õ': Exit('o');
  70.     'ö': Exit('o');
  71.     'ø': Exit('o');
  72.     'ù': Exit('u');
  73.     'ú': Exit('u');
  74.     'û': Exit('u');
  75.     'ü': Exit('u');
  76.     'ý': Exit('y');
  77.     'þ': Exit('t');
  78.     'ÿ': Exit('y');
  79.     'Ā': Exit('A');
  80.     'ā': Exit('a');
  81.     'Ă': Exit('A');
  82.     'ă': Exit('a');
  83.     'Ą': Exit('A');
  84.     'ą': Exit('a');
  85.     'Ć': Exit('C');
  86.     'ć': Exit('c');
  87.     'Ĉ': Exit('C');
  88.     'ĉ': Exit('c');
  89.     'Ċ': Exit('C');
  90.     'ċ': Exit('c');
  91.     'Č': Exit('C');
  92.     'č': Exit('c');
  93.     'Ď': Exit('D');
  94.     'ď': Exit('d');
  95.     'Đ': Exit('D');
  96.     'đ': Exit('d');
  97.     'Ē': Exit('E');
  98.     'ē': Exit('e');
  99.     'Ĕ': Exit('E');
  100.     'ĕ': Exit('e');
  101.     'Ė': Exit('E');
  102.     'ė': Exit('e');
  103.     'Ę': Exit('E');
  104.     'ę': Exit('e');
  105.     'Ě': Exit('E');
  106.     'ě': Exit('e');
  107.     'Ĝ': Exit('G');
  108.     'ĝ': Exit('g');
  109.     'Ğ': Exit('G');
  110.     'ğ': Exit('g');
  111.     'Ġ': Exit('G');
  112.     'ġ': Exit('g');
  113.     'Ģ': Exit('G');
  114.     'ģ': Exit('g');
  115.     'Ĥ': Exit('H');
  116.     'ĥ': Exit('h');
  117.     'Ħ': Exit('H');
  118.     'ħ': Exit('h');
  119.     'Ĩ': Exit('I');
  120.     'ĩ': Exit('i');
  121.     'Ī': Exit('I');
  122.     'ī': Exit('i');
  123.     'Ĭ': Exit('I');
  124.     'ĭ': Exit('i');
  125.     'Į': Exit('I');
  126.     'į': Exit('i');
  127.     'İ': Exit('I');
  128.     'ı': Exit('i');
  129.     'IJ': Exit('I');
  130.     'ij': Exit('i');
  131.     'Ĵ': Exit('J');
  132.     'ĵ': Exit('j');
  133.     'Ķ': Exit('K');
  134.     'ķ': Exit('k');
  135.     'ĸ': Exit('q');
  136.     'Ĺ': Exit('L');
  137.     'ĺ': Exit('l');
  138.     'Ļ': Exit('L');
  139.     'ļ': Exit('l');
  140.     'Ľ': Exit('L');
  141.     'ľ': Exit('l');
  142.     'Ŀ': Exit('L');
  143.     'ŀ': Exit('l');
  144.     'Ł': Exit('L');
  145.     'ł': Exit('l');
  146.     'Ń': Exit('N');
  147.     'ń': Exit('n');
  148.     'Ņ': Exit('N');
  149.     'ņ': Exit('n');
  150.     'Ň': Exit('N');
  151.     'ň': Exit('n');
  152.     'Ŋ': Exit('N');
  153.     'ŋ': Exit('n');
  154.     'Ō': Exit('O');
  155.     'ō': Exit('o');
  156.     'Ŏ': Exit('O');
  157.     'ŏ': Exit('o');
  158.     'Ő': Exit('O');
  159.     'ő': Exit('o');
  160.     'Œ': Exit('O');
  161.     'œ': Exit('o');
  162.     'Ŕ': Exit('R');
  163.     'ŕ': Exit('r');
  164.     'Ŗ': Exit('R');
  165.     'ŗ': Exit('r');
  166.     'Ř': Exit('R');
  167.     'ř': Exit('r');
  168.     'Ś': Exit('S');
  169.     'ś': Exit('s');
  170.     'Ŝ': Exit('S');
  171.     'ŝ': Exit('s');
  172.     'Ş': Exit('S');
  173.     'ş': Exit('s');
  174.     'Š': Exit('S');
  175.     'š': Exit('s');
  176.     'Ţ': Exit('T');
  177.     'ţ': Exit('t');
  178.     'Ť': Exit('T');
  179.     'ť': Exit('t');
  180.     'Ŧ': Exit('T');
  181.     'ŧ': Exit('t');
  182.     'Ũ': Exit('U');
  183.     'ũ': Exit('u');
  184.     'Ū': Exit('U');
  185.     'ū': Exit('u');
  186.     'Ŭ': Exit('U');
  187.     'ŭ': Exit('u');
  188.     'Ů': Exit('U');
  189.     'ů': Exit('u');
  190.     'Ű': Exit('U');
  191.     'ű': Exit('u');
  192.     'Ų': Exit('U');
  193.     'ų': Exit('u');
  194.     'Ŵ': Exit('W');
  195.     'ŵ': Exit('w');
  196.     'Ŷ': Exit('Y');
  197.     'ŷ': Exit('y');
  198.     'Ÿ': Exit('Y');
  199.     'Ź': Exit('Z');
  200.     'ź': Exit('z');
  201.     'Ż': Exit('Z');
  202.     'ż': Exit('z');
  203.     'Ž': Exit('Z');
  204.     'ž': Exit('z');
  205.     'ſ': Exit('s');
  206.     'ƀ': Exit('b');
  207.     'Ɓ': Exit('B');
  208.     'Ƃ': Exit('B');
  209.     'ƃ': Exit('b');
  210.     'Ƈ': Exit('C');
  211.     'ƈ': Exit('c');
  212.     'Ɖ': Exit('D');
  213.     'Ɗ': Exit('D');
  214.     'Ƌ': Exit('D');
  215.     'ƌ': Exit('d');
  216.     'Ɛ': Exit('E');
  217.     'Ƒ': Exit('F');
  218.     'ƒ': Exit('f');
  219.     'Ɠ': Exit('G');
  220.     'ƕ': Exit('h');
  221.     'Ɩ': Exit('I');
  222.     'Ɨ': Exit('I');
  223.     'Ƙ': Exit('K');
  224.     'ƙ': Exit('k');
  225.     'ƚ': Exit('l');
  226.     'Ɲ': Exit('N');
  227.     'ƞ': Exit('n');
  228.     'Ơ': Exit('O');
  229.     'ơ': Exit('o');
  230.     'Ƣ': Exit('O');
  231.     'ƣ': Exit('o');
  232.     'Ƥ': Exit('P');
  233.     'ƥ': Exit('p');
  234.     'ƫ': Exit('t');
  235.     'Ƭ': Exit('T');
  236.     'ƭ': Exit('t');
  237.     'Ʈ': Exit('T');
  238.     'Ư': Exit('U');
  239.     'ư': Exit('u');
  240.     'Ʋ': Exit('V');
  241.     'Ƴ': Exit('Y');
  242.     'ƴ': Exit('y');
  243.     'Ƶ': Exit('Z');
  244.     'ƶ': Exit('z');
  245.     'LJ': Exit('L');
  246.     'Lj': Exit('L');
  247.     'lj': Exit('l');
  248.     'NJ': Exit('N');
  249.     'Nj': Exit('N');
  250.     'nj': Exit('n');
  251.     'Ǎ': Exit('A');
  252.     'ǎ': Exit('a');
  253.     'Ǐ': Exit('I');
  254.     'ǐ': Exit('i');
  255.     'Ǒ': Exit('O');
  256.     'ǒ': Exit('o');
  257.     'Ǔ': Exit('U');
  258.     'ǔ': Exit('u');
  259.     'Ǖ': Exit('U');
  260.     'ǖ': Exit('u');
  261.     'Ǘ': Exit('U');
  262.     'ǘ': Exit('u');
  263.     'Ǚ': Exit('U');
  264.     'ǚ': Exit('u');
  265.     'Ǜ': Exit('U');
  266.     'ǜ': Exit('u');
  267.     'Ǟ': Exit('A');
  268.     'ǟ': Exit('a');
  269.     'Ǡ': Exit('A');
  270.     'ǡ': Exit('a');
  271.     'Ǣ': Exit('A');
  272.     'ǣ': Exit('a');
  273.     'Ǥ': Exit('G');
  274.     'ǥ': Exit('g');
  275.     'Ǧ': Exit('G');
  276.     'ǧ': Exit('g');
  277.     'Ǩ': Exit('K');
  278.     'ǩ': Exit('k');
  279.     'Ǫ': Exit('O');
  280.     'ǫ': Exit('o');
  281.     'Ǭ': Exit('O');
  282.     'ǭ': Exit('o');
  283.     'ǰ': Exit('j');
  284.     'DZ': Exit('D');
  285.     'Dz': Exit('D');
  286.     'dz': Exit('d');
  287.     'Ǵ': Exit('G');
  288.     'ǵ': Exit('g');
  289.     'Ǹ': Exit('N');
  290.     'ǹ': Exit('n');
  291.     'Ǻ': Exit('A');
  292.     'ǻ': Exit('a');
  293.     'Ǽ': Exit('A');
  294.     'ǽ': Exit('a');
  295.     'Ǿ': Exit('O');
  296.     'ǿ': Exit('o');
  297.     'Ȁ': Exit('A');
  298.     'ȁ': Exit('a');
  299.     'Ȃ': Exit('A');
  300.     'ȃ': Exit('a');
  301.     'Ȅ': Exit('E');
  302.     'ȅ': Exit('e');
  303.     'Ȇ': Exit('E');
  304.     'ȇ': Exit('e');
  305.     'Ȉ': Exit('I');
  306.     'ȉ': Exit('i');
  307.     'Ȋ': Exit('I');
  308.     'ȋ': Exit('i');
  309.     'Ȍ': Exit('O');
  310.     'ȍ': Exit('o');
  311.     'Ȏ': Exit('O');
  312.     'ȏ': Exit('o');
  313.     'Ȑ': Exit('R');
  314.     'ȑ': Exit('r');
  315.     'Ȓ': Exit('R');
  316.     'ȓ': Exit('r');
  317.     'Ȕ': Exit('U');
  318.     'ȕ': Exit('u');
  319.     'Ȗ': Exit('U');
  320.     'ȗ': Exit('u');
  321.     'Ș': Exit('S');
  322.     'ș': Exit('s');
  323.     'Ț': Exit('T');
  324.     'ț': Exit('t');
  325.     'Ȟ': Exit('H');
  326.     'ȟ': Exit('h');
  327.     'ȡ': Exit('d');
  328.     'Ȥ': Exit('Z');
  329.     'ȥ': Exit('z');
  330.     'Ȧ': Exit('A');
  331.     'ȧ': Exit('a');
  332.     'Ȩ': Exit('E');
  333.     'ȩ': Exit('e');
  334.     'Ȫ': Exit('O');
  335.     'ȫ': Exit('o');
  336.     'Ȭ': Exit('O');
  337.     'ȭ': Exit('o');
  338.     'Ȯ': Exit('O');
  339.     'ȯ': Exit('o');
  340.     'Ȱ': Exit('O');
  341.     'ȱ': Exit('o');
  342.     'Ȳ': Exit('Y');
  343.     'ȳ': Exit('y');
  344.     'ȴ': Exit('l');
  345.     'ȵ': Exit('n');
  346.     'ȶ': Exit('t');
  347.     'ȷ': Exit('j');
  348.     'ȸ': Exit('d');
  349.     'ȹ': Exit('q');
  350.     'Ⱥ': Exit('A');
  351.     'Ȼ': Exit('C');
  352.     'ȼ': Exit('c');
  353.     'Ƚ': Exit('L');
  354.     'Ⱦ': Exit('T');
  355.     'ȿ': Exit('s');
  356.     'ɀ': Exit('z');
  357.     'Ƀ': Exit('B');
  358.     'Ʉ': Exit('U');
  359.     'Ɇ': Exit('E');
  360.     'ɇ': Exit('e');
  361.     'Ɉ': Exit('J');
  362.     'ɉ': Exit('j');
  363.     'Ɍ': Exit('R');
  364.     'ɍ': Exit('r');
  365.     'Ɏ': Exit('Y');
  366.     'ɏ': Exit('y');
  367.     'ɓ': Exit('b');
  368.     'ɕ': Exit('c');
  369.     'ɖ': Exit('d');
  370.     'ɗ': Exit('d');
  371.     'ɛ': Exit('e');
  372.     'ɟ': Exit('j');
  373.     'ɠ': Exit('g');
  374.     'ɡ': Exit('g');
  375.     'ɢ': Exit('G');
  376.     'ɦ': Exit('h');
  377.     'ɧ': Exit('h');
  378.     'ɨ': Exit('i');
  379.     'ɪ': Exit('I');
  380.     'ɫ': Exit('l');
  381.     'ɬ': Exit('l');
  382.     'ɭ': Exit('l');
  383.     'ɱ': Exit('m');
  384.     'ɲ': Exit('n');
  385.     'ɳ': Exit('n');
  386.     'ɴ': Exit('N');
  387.     'ɶ': Exit('O');
  388.     'ɼ': Exit('r');
  389.     'ɽ': Exit('r');
  390.     'ɾ': Exit('r');
  391.     'ʀ': Exit('R');
  392.     'ʂ': Exit('s');
  393.     'ʈ': Exit('t');
  394.     'ʉ': Exit('u');
  395.     'ʋ': Exit('v');
  396.     'ʏ': Exit('Y');
  397.     'ʐ': Exit('z');
  398.     'ʑ': Exit('z');
  399.     'ʙ': Exit('B');
  400.     'ʛ': Exit('G');
  401.     'ʜ': Exit('H');
  402.     'ʝ': Exit('j');
  403.     'ʟ': Exit('L');
  404.     'ʠ': Exit('q');
  405.     'ʣ': Exit('d');
  406.     'ʥ': Exit('d');
  407.     'ʦ': Exit('t');
  408.     'ʪ': Exit('l');
  409.     'ʫ': Exit('l');
  410.     'ʰ': Exit('h');
  411.     'ʲ': Exit('j');
  412.     'ʳ': Exit('r');
  413.     'ʷ': Exit('w');
  414.     'ʸ': Exit('y');
  415.     'ˡ': Exit('l');
  416.     'ˢ': Exit('s');
  417.     'ˣ': Exit('x');
  418.     else Exit('?');
  419.   end;
  420. end;
  421.  
  422. function AccentedNameToAscii(aUTF8Name: String): String;
  423. var
  424.   p, pEnd: Pchar;
  425.   i, j: Integer;
  426.   s: String;
  427. begin
  428.   Result := '';
  429.   p := PChar(aUTF8Name);
  430.   pEnd := p;
  431.   Inc(pEnd, Length(aUTF8Name));
  432.   repeat
  433.     i := UTF8CodepointSize(p);
  434.     case i of
  435.       1: Result += p^;
  436.       else
  437.         begin
  438.           SetLength(s{%H-}, i);
  439.           for j := 1 to i do
  440.             begin
  441.               Inc(p, j-1);
  442.               s[j] := p^;
  443.             end;
  444.           Result += NameChrToASCII(s);
  445.         end;
  446.     end;
  447.     Inc(p);
  448.   until p >= pEnd;
  449. end;
  450.  
  451. var
  452.   strs: TStringDynArray;
  453.   s: String;
  454.  
  455. begin
  456.   strs := TStringDynArray.Create('Les Bruyères', 'Centre Médical Héliporté',
  457.                                   'Vésale Heliport', 'Saïss Airport',
  458.                                   'Fès-Boulemane', 'Léopold', 'Kédougou',
  459.                                   'Cesária', 'Évora', 'São', 'Ploče', 'Otočac',
  460.                                   'Čakovec', 'Almería', 'León', 'León',
  461.                                   'Logroño-Agoncillo', 'Suárez', 'Compiègne',
  462.                                   'Tréport', 'Périgueux', 'Targé', 'Châtellerault',
  463.                                   'Épernay', 'Pápa', 'Pécs-Pogány', 'Győr-Pér', 'Pér');
  464.   for s in strs do
  465.     Writeln(s,'  ->  "',AccentedNameToAscii(s),'"');
  466.   WriteLn(#10'Press [Enter] to finish');
  467.   ReadLn;
  468. end.
Title: Re: Character Conversions
Post by: valdir.marcos on September 21, 2019, 02:58:29 pm
Here's an example using a simple function that should work on Windows (not tested) which does not use iconvenc.

The NameChrToASCII function works fine on Linux, and has no linux-specific dependencies so should be OK on Windows. it is far from comprehensive, but certainly covers all accented Unicode codepoints in JLWest's example.

Code: Pascal  [Select][+][-]
  1. program TestUTF8ToASCII;
  2.  
  3. {$mode objfpc}{$H+}
  4. {$IfDef windows}
  5. {$AppType console}
  6. {$EndIf}
  7.  
  8. uses
  9.   Classes, LazUTF8, Types;
  10.  
  11. function NameChrToASCII(aUTF8Codepoint: String): Char;
  12. begin
  13.   if Length(aUTF8Codepoint) > 2 then
  14.     Exit('?');
  15.   case aUTF8Codepoint of
  16.     'À': Exit('A');
  17.     'Á': Exit('A');
  18.     'Â': Exit('A');
  19.     'Ã': Exit('A');
  20.     'Ä': Exit('A');
  21.     'Å': Exit('A');
  22.     'Æ': Exit('A');
  23.     'Ç': Exit('C');
  24.     'È': Exit('E');
  25.     'É': Exit('E');
  26.     'Ê': Exit('E');
  27.     'Ë': Exit('E');
  28.     'Ì': Exit('I');
  29.     'Í': Exit('I');
  30.     'Î': Exit('I');
  31.     'Ï': Exit('I');
  32.     'Ð': Exit('D');
  33.     'Ñ': Exit('N');
  34.     'Ò': Exit('O');
  35.     'Ó': Exit('O');
  36.     'Ô': Exit('O');
  37.     'Õ': Exit('O');
  38.     'Ö': Exit('O');
  39.     '×': Exit('x');
  40.     'Ø': Exit('O');
  41.     'Ù': Exit('U');
  42.     'Ú': Exit('U');
  43.     'Û': Exit('U');
  44.     'Ü': Exit('U');
  45.     'Ý': Exit('Y');
  46.     'Þ': Exit('T');
  47.     'ß': Exit('s');
  48.     'à': Exit('a');
  49.     'á': Exit('a');
  50.     'â': Exit('a');
  51.     'ã': Exit('a');
  52.     'ä': Exit('a');
  53.     'å': Exit('a');
  54.     'æ': Exit('a');
  55.     'ç': Exit('c');
  56.     'è': Exit('e');
  57.     'é': Exit('e');
  58.     'ê': Exit('e');
  59.     'ë': Exit('e');
  60.     'ì': Exit('i');
  61.     'í': Exit('i');
  62.     'î': Exit('i');
  63.     'ï': Exit('i');
  64.     'ð': Exit('d');
  65.     'ñ': Exit('n');
  66.     'ò': Exit('o');
  67.     'ó': Exit('o');
  68.     'ô': Exit('o');
  69.     'õ': Exit('o');
  70.     'ö': Exit('o');
  71.     'ø': Exit('o');
  72.     'ù': Exit('u');
  73.     'ú': Exit('u');
  74.     'û': Exit('u');
  75.     'ü': Exit('u');
  76.     'ý': Exit('y');
  77.     'þ': Exit('t');
  78.     'ÿ': Exit('y');
  79.     'Ā': Exit('A');
  80.     'ā': Exit('a');
  81.     'Ă': Exit('A');
  82.     'ă': Exit('a');
  83.     'Ą': Exit('A');
  84.     'ą': Exit('a');
  85.     'Ć': Exit('C');
  86.     'ć': Exit('c');
  87.     'Ĉ': Exit('C');
  88.     'ĉ': Exit('c');
  89.     'Ċ': Exit('C');
  90.     'ċ': Exit('c');
  91.     'Č': Exit('C');
  92.     'č': Exit('c');
  93.     'Ď': Exit('D');
  94.     'ď': Exit('d');
  95.     'Đ': Exit('D');
  96.     'đ': Exit('d');
  97.     'Ē': Exit('E');
  98.     'ē': Exit('e');
  99.     'Ĕ': Exit('E');
  100.     'ĕ': Exit('e');
  101.     'Ė': Exit('E');
  102.     'ė': Exit('e');
  103.     'Ę': Exit('E');
  104.     'ę': Exit('e');
  105.     'Ě': Exit('E');
  106.     'ě': Exit('e');
  107.     'Ĝ': Exit('G');
  108.     'ĝ': Exit('g');
  109.     'Ğ': Exit('G');
  110.     'ğ': Exit('g');
  111.     'Ġ': Exit('G');
  112.     'ġ': Exit('g');
  113.     'Ģ': Exit('G');
  114.     'ģ': Exit('g');
  115.     'Ĥ': Exit('H');
  116.     'ĥ': Exit('h');
  117.     'Ħ': Exit('H');
  118.     'ħ': Exit('h');
  119.     'Ĩ': Exit('I');
  120.     'ĩ': Exit('i');
  121.     'Ī': Exit('I');
  122.     'ī': Exit('i');
  123.     'Ĭ': Exit('I');
  124.     'ĭ': Exit('i');
  125.     'Į': Exit('I');
  126.     'į': Exit('i');
  127.     'İ': Exit('I');
  128.     'ı': Exit('i');
  129.     'IJ': Exit('I');
  130.     'ij': Exit('i');
  131.     'Ĵ': Exit('J');
  132.     'ĵ': Exit('j');
  133.     'Ķ': Exit('K');
  134.     'ķ': Exit('k');
  135.     'ĸ': Exit('q');
  136.     'Ĺ': Exit('L');
  137.     'ĺ': Exit('l');
  138.     'Ļ': Exit('L');
  139.     'ļ': Exit('l');
  140.     'Ľ': Exit('L');
  141.     'ľ': Exit('l');
  142.     'Ŀ': Exit('L');
  143.     'ŀ': Exit('l');
  144.     'Ł': Exit('L');
  145.     'ł': Exit('l');
  146.     'Ń': Exit('N');
  147.     'ń': Exit('n');
  148.     'Ņ': Exit('N');
  149.     'ņ': Exit('n');
  150.     'Ň': Exit('N');
  151.     'ň': Exit('n');
  152.     'Ŋ': Exit('N');
  153.     'ŋ': Exit('n');
  154.     'Ō': Exit('O');
  155.     'ō': Exit('o');
  156.     'Ŏ': Exit('O');
  157.     'ŏ': Exit('o');
  158.     'Ő': Exit('O');
  159.     'ő': Exit('o');
  160.     'Œ': Exit('O');
  161.     'œ': Exit('o');
  162.     'Ŕ': Exit('R');
  163.     'ŕ': Exit('r');
  164.     'Ŗ': Exit('R');
  165.     'ŗ': Exit('r');
  166.     'Ř': Exit('R');
  167.     'ř': Exit('r');
  168.     'Ś': Exit('S');
  169.     'ś': Exit('s');
  170.     'Ŝ': Exit('S');
  171.     'ŝ': Exit('s');
  172.     'Ş': Exit('S');
  173.     'ş': Exit('s');
  174.     'Š': Exit('S');
  175.     'š': Exit('s');
  176.     'Ţ': Exit('T');
  177.     'ţ': Exit('t');
  178.     'Ť': Exit('T');
  179.     'ť': Exit('t');
  180.     'Ŧ': Exit('T');
  181.     'ŧ': Exit('t');
  182.     'Ũ': Exit('U');
  183.     'ũ': Exit('u');
  184.     'Ū': Exit('U');
  185.     'ū': Exit('u');
  186.     'Ŭ': Exit('U');
  187.     'ŭ': Exit('u');
  188.     'Ů': Exit('U');
  189.     'ů': Exit('u');
  190.     'Ű': Exit('U');
  191.     'ű': Exit('u');
  192.     'Ų': Exit('U');
  193.     'ų': Exit('u');
  194.     'Ŵ': Exit('W');
  195.     'ŵ': Exit('w');
  196.     'Ŷ': Exit('Y');
  197.     'ŷ': Exit('y');
  198.     'Ÿ': Exit('Y');
  199.     'Ź': Exit('Z');
  200.     'ź': Exit('z');
  201.     'Ż': Exit('Z');
  202.     'ż': Exit('z');
  203.     'Ž': Exit('Z');
  204.     'ž': Exit('z');
  205.     'ſ': Exit('s');
  206.     'ƀ': Exit('b');
  207.     'Ɓ': Exit('B');
  208.     'Ƃ': Exit('B');
  209.     'ƃ': Exit('b');
  210.     'Ƈ': Exit('C');
  211.     'ƈ': Exit('c');
  212.     'Ɖ': Exit('D');
  213.     'Ɗ': Exit('D');
  214.     'Ƌ': Exit('D');
  215.     'ƌ': Exit('d');
  216.     'Ɛ': Exit('E');
  217.     'Ƒ': Exit('F');
  218.     'ƒ': Exit('f');
  219.     'Ɠ': Exit('G');
  220.     'ƕ': Exit('h');
  221.     'Ɩ': Exit('I');
  222.     'Ɨ': Exit('I');
  223.     'Ƙ': Exit('K');
  224.     'ƙ': Exit('k');
  225.     'ƚ': Exit('l');
  226.     'Ɲ': Exit('N');
  227.     'ƞ': Exit('n');
  228.     'Ơ': Exit('O');
  229.     'ơ': Exit('o');
  230.     'Ƣ': Exit('O');
  231.     'ƣ': Exit('o');
  232.     'Ƥ': Exit('P');
  233.     'ƥ': Exit('p');
  234.     'ƫ': Exit('t');
  235.     'Ƭ': Exit('T');
  236.     'ƭ': Exit('t');
  237.     'Ʈ': Exit('T');
  238.     'Ư': Exit('U');
  239.     'ư': Exit('u');
  240.     'Ʋ': Exit('V');
  241.     'Ƴ': Exit('Y');
  242.     'ƴ': Exit('y');
  243.     'Ƶ': Exit('Z');
  244.     'ƶ': Exit('z');
  245.     'LJ': Exit('L');
  246.     'Lj': Exit('L');
  247.     'lj': Exit('l');
  248.     'NJ': Exit('N');
  249.     'Nj': Exit('N');
  250.     'nj': Exit('n');
  251.     'Ǎ': Exit('A');
  252.     'ǎ': Exit('a');
  253.     'Ǐ': Exit('I');
  254.     'ǐ': Exit('i');
  255.     'Ǒ': Exit('O');
  256.     'ǒ': Exit('o');
  257.     'Ǔ': Exit('U');
  258.     'ǔ': Exit('u');
  259.     'Ǖ': Exit('U');
  260.     'ǖ': Exit('u');
  261.     'Ǘ': Exit('U');
  262.     'ǘ': Exit('u');
  263.     'Ǚ': Exit('U');
  264.     'ǚ': Exit('u');
  265.     'Ǜ': Exit('U');
  266.     'ǜ': Exit('u');
  267.     'Ǟ': Exit('A');
  268.     'ǟ': Exit('a');
  269.     'Ǡ': Exit('A');
  270.     'ǡ': Exit('a');
  271.     'Ǣ': Exit('A');
  272.     'ǣ': Exit('a');
  273.     'Ǥ': Exit('G');
  274.     'ǥ': Exit('g');
  275.     'Ǧ': Exit('G');
  276.     'ǧ': Exit('g');
  277.     'Ǩ': Exit('K');
  278.     'ǩ': Exit('k');
  279.     'Ǫ': Exit('O');
  280.     'ǫ': Exit('o');
  281.     'Ǭ': Exit('O');
  282.     'ǭ': Exit('o');
  283.     'ǰ': Exit('j');
  284.     'DZ': Exit('D');
  285.     'Dz': Exit('D');
  286.     'dz': Exit('d');
  287.     'Ǵ': Exit('G');
  288.     'ǵ': Exit('g');
  289.     'Ǹ': Exit('N');
  290.     'ǹ': Exit('n');
  291.     'Ǻ': Exit('A');
  292.     'ǻ': Exit('a');
  293.     'Ǽ': Exit('A');
  294.     'ǽ': Exit('a');
  295.     'Ǿ': Exit('O');
  296.     'ǿ': Exit('o');
  297.     'Ȁ': Exit('A');
  298.     'ȁ': Exit('a');
  299.     'Ȃ': Exit('A');
  300.     'ȃ': Exit('a');
  301.     'Ȅ': Exit('E');
  302.     'ȅ': Exit('e');
  303.     'Ȇ': Exit('E');
  304.     'ȇ': Exit('e');
  305.     'Ȉ': Exit('I');
  306.     'ȉ': Exit('i');
  307.     'Ȋ': Exit('I');
  308.     'ȋ': Exit('i');
  309.     'Ȍ': Exit('O');
  310.     'ȍ': Exit('o');
  311.     'Ȏ': Exit('O');
  312.     'ȏ': Exit('o');
  313.     'Ȑ': Exit('R');
  314.     'ȑ': Exit('r');
  315.     'Ȓ': Exit('R');
  316.     'ȓ': Exit('r');
  317.     'Ȕ': Exit('U');
  318.     'ȕ': Exit('u');
  319.     'Ȗ': Exit('U');
  320.     'ȗ': Exit('u');
  321.     'Ș': Exit('S');
  322.     'ș': Exit('s');
  323.     'Ț': Exit('T');
  324.     'ț': Exit('t');
  325.     'Ȟ': Exit('H');
  326.     'ȟ': Exit('h');
  327.     'ȡ': Exit('d');
  328.     'Ȥ': Exit('Z');
  329.     'ȥ': Exit('z');
  330.     'Ȧ': Exit('A');
  331.     'ȧ': Exit('a');
  332.     'Ȩ': Exit('E');
  333.     'ȩ': Exit('e');
  334.     'Ȫ': Exit('O');
  335.     'ȫ': Exit('o');
  336.     'Ȭ': Exit('O');
  337.     'ȭ': Exit('o');
  338.     'Ȯ': Exit('O');
  339.     'ȯ': Exit('o');
  340.     'Ȱ': Exit('O');
  341.     'ȱ': Exit('o');
  342.     'Ȳ': Exit('Y');
  343.     'ȳ': Exit('y');
  344.     'ȴ': Exit('l');
  345.     'ȵ': Exit('n');
  346.     'ȶ': Exit('t');
  347.     'ȷ': Exit('j');
  348.     'ȸ': Exit('d');
  349.     'ȹ': Exit('q');
  350.     'Ⱥ': Exit('A');
  351.     'Ȼ': Exit('C');
  352.     'ȼ': Exit('c');
  353.     'Ƚ': Exit('L');
  354.     'Ⱦ': Exit('T');
  355.     'ȿ': Exit('s');
  356.     'ɀ': Exit('z');
  357.     'Ƀ': Exit('B');
  358.     'Ʉ': Exit('U');
  359.     'Ɇ': Exit('E');
  360.     'ɇ': Exit('e');
  361.     'Ɉ': Exit('J');
  362.     'ɉ': Exit('j');
  363.     'Ɍ': Exit('R');
  364.     'ɍ': Exit('r');
  365.     'Ɏ': Exit('Y');
  366.     'ɏ': Exit('y');
  367.     'ɓ': Exit('b');
  368.     'ɕ': Exit('c');
  369.     'ɖ': Exit('d');
  370.     'ɗ': Exit('d');
  371.     'ɛ': Exit('e');
  372.     'ɟ': Exit('j');
  373.     'ɠ': Exit('g');
  374.     'ɡ': Exit('g');
  375.     'ɢ': Exit('G');
  376.     'ɦ': Exit('h');
  377.     'ɧ': Exit('h');
  378.     'ɨ': Exit('i');
  379.     'ɪ': Exit('I');
  380.     'ɫ': Exit('l');
  381.     'ɬ': Exit('l');
  382.     'ɭ': Exit('l');
  383.     'ɱ': Exit('m');
  384.     'ɲ': Exit('n');
  385.     'ɳ': Exit('n');
  386.     'ɴ': Exit('N');
  387.     'ɶ': Exit('O');
  388.     'ɼ': Exit('r');
  389.     'ɽ': Exit('r');
  390.     'ɾ': Exit('r');
  391.     'ʀ': Exit('R');
  392.     'ʂ': Exit('s');
  393.     'ʈ': Exit('t');
  394.     'ʉ': Exit('u');
  395.     'ʋ': Exit('v');
  396.     'ʏ': Exit('Y');
  397.     'ʐ': Exit('z');
  398.     'ʑ': Exit('z');
  399.     'ʙ': Exit('B');
  400.     'ʛ': Exit('G');
  401.     'ʜ': Exit('H');
  402.     'ʝ': Exit('j');
  403.     'ʟ': Exit('L');
  404.     'ʠ': Exit('q');
  405.     'ʣ': Exit('d');
  406.     'ʥ': Exit('d');
  407.     'ʦ': Exit('t');
  408.     'ʪ': Exit('l');
  409.     'ʫ': Exit('l');
  410.     'ʰ': Exit('h');
  411.     'ʲ': Exit('j');
  412.     'ʳ': Exit('r');
  413.     'ʷ': Exit('w');
  414.     'ʸ': Exit('y');
  415.     'ˡ': Exit('l');
  416.     'ˢ': Exit('s');
  417.     'ˣ': Exit('x');
  418.     else Exit('?');
  419.   end;
  420. end;
  421.  
  422. function AccentedNameToAscii(aUTF8Name: String): String;
  423. var
  424.   p, pEnd: Pchar;
  425.   i, j: Integer;
  426.   s: String;
  427. begin
  428.   Result := '';
  429.   p := PChar(aUTF8Name);
  430.   pEnd := p;
  431.   Inc(pEnd, Length(aUTF8Name));
  432.   repeat
  433.     i := UTF8CodepointSize(p);
  434.     case i of
  435.       1: Result += p^;
  436.       else
  437.         begin
  438.           SetLength(s{%H-}, i);
  439.           for j := 1 to i do
  440.             begin
  441.               Inc(p, j-1);
  442.               s[j] := p^;
  443.             end;
  444.           Result += NameChrToASCII(s);
  445.         end;
  446.     end;
  447.     Inc(p);
  448.   until p >= pEnd;
  449. end;
  450.  
  451. var
  452.   strs: TStringDynArray;
  453.   s: String;
  454.  
  455. begin
  456.   strs := TStringDynArray.Create('Les Bruyères', 'Centre Médical Héliporté',
  457.                                   'Vésale Heliport', 'Saïss Airport',
  458.                                   'Fès-Boulemane', 'Léopold', 'Kédougou',
  459.                                   'Cesária', 'Évora', 'São', 'Ploče', 'Otočac',
  460.                                   'Čakovec', 'Almería', 'León', 'León',
  461.                                   'Logroño-Agoncillo', 'Suárez', 'Compiègne',
  462.                                   'Tréport', 'Périgueux', 'Targé', 'Châtellerault',
  463.                                   'Épernay', 'Pápa', 'Pécs-Pogány', 'Győr-Pér', 'Pér');
  464.   for s in strs do
  465.     Writeln(s,'  ->  "',AccentedNameToAscii(s),'"');
  466.   WriteLn(#10'Press [Enter] to finish');
  467.   ReadLn;
  468. end.
Interesting.
Title: Re: Character Conversions
Post by: marcov on September 21, 2019, 04:22:36 pm
A good solution is more than a simple array. Some non Western scripts can have multiple accents per character and other special compositing solutions.
Title: Re: Character Conversions
Post by: winni on September 21, 2019, 11:04:07 pm
Hi!

As Markov told me I got that bunch of file from the utf8-consortium last night. This is really near to a whole expert-system. Yes, there are a lot of rules: at the beginning of a word it this otherwise that. And before a wovel it is that otherwise this. And, and.....

So I decided to come to a single-char-solution like Howardpc, but I did a little more of work:
* Complete Latin 1 supplement
* Latin Extended A
* Greek Alphabet
* Russian Alphabet

So I think the most of Europe is done. Latin Extended B, C and D have so rare letters that it's not worth.

I made a little app that converts Utf8-textfiles to ASCII. If anybody thinks that's something not complete or he want's to add another language: Just enhance the constant array in unit utf8toAsciiConvert. Then everything is done.

As a hardrock test I just converted a csv geo database with 1.4 GB an 11 million lines. Takes some time but works fine.

The whole converting is done in the unit utf8toAsciiConvert so you can use it in other applications.

Winni
Title: Re: Character Conversions
Post by: marcov on September 21, 2019, 11:33:26 pm
Does ß to ss work ?
Title: Re: Character Conversions
Post by: winni on September 21, 2019, 11:37:48 pm
line 86
Title: Re: Character Conversions
Post by: JLWest on September 22, 2019, 06:53:00 am
I have been out for a few days (Hospital) Just got back.

WOW - WOW

Can't wait to give this a test. At this moment I have 170 files all of which have 8 fields Each file is a country file with all of the airports for that country:

|KAAA|Nil|Logan County Airport|Lincoln|Illinois|United States|
|KAAF|AAF|Apalachicola Regional Airport|Apalachicola|Florida|United States|
|KAAO|Nil|Colonel James Jabara Airport|Wichita|Kansas|United States|
|KAAS|Nil|Taylor County Airport|Campbellsville|Kentucky|United States|

The data comes from https://en.wikipedia.org/wiki/List_of_airports_by_ICAO_code:

I have to Copy and Paste and do a little prep, Then I run a program which formats to the above. Some of the files are quite small (Guam 1-Airport) but some are really big US, Germany, Spain, China, Russia Brazil.

So I'm about ready to copy out the T's from the site. Implement a test Demo and see what happens. But WOW This is great Thanks.

TA - Antigua and Barbuda
    TAPA (ANU) – VC Bird International Airport – Saint John's, Antigua
    TAPH (BBQ) – Codrington Airport – Codrington, Barbuda
    TAPT – Coco Point Lodge Airport – Coco Point, Barbuda
TB - Barbados
    TBPB (BGI) – Grantley Adams International Airport – Bridgetown
    TBPO – Bridgetown Heliport – Bridgetown (closed)
TD - Dominica
    TDCF (DCF) – Canefield Airport – Roseau
    TDPD (DOM) – Melville Hall Airport – Marigot
TF - Guadeloupe
    TFFA (DSD) – La Désirade Airport – Beauséjour, La Désirade
    TFFB (BBR) – Baillif Airport – Baillif, Basse-Terre
    TFFC (SFC) – Saint-François Airport – Saint-François, Grande-Terre
    TFFM (GBJ) – Marie-Galante Airport – Grand-Bourg, Marie-Galante
    TFFR (PTP) – Pointe-à-Pitre - Le Raizet Airport – Pointe-à-Pitre, Grande-Terre
    TFFS (LSS) – Les Saintes Airport – Terre-de-Haut, Les Saintes
Martinique
    TFFF (FDF) – Fort-de-France - Le Lamentin Airport – Le Lamentin, Fort-de-France
    TFFJ (SBH) – Gustaf III Airport – St. Jean
Saint Martin (France)
    TFFG (SFG) – L'Espérance Airport – Grand Case
TG - Grenada
    TGPG – Pearls Airport – Grenville
    TGPY (GND) – Maurice Bishop International Airport – St. George's
    TGPZ (CRU) – Lauriston Airport (Carriacou Island Airport) – Hillsborough, Carriacou Island
 
Title: Re: FIRST TEST Character Conversions
Post by: JLWest on September 22, 2019, 08:35:07 am

Input File:

TFFA (DSD) - La Désirade Airport - Beauséjour - La Désirade
    TFFB (BBR) - Baillif Airport - Baillif - Basse-Terre
    TFFC (SFC) - Saint-François Airport - Saint-François - Grande-Terre
    TFFM (GBJ) - Marie-Galante Airport - Grand-Bourg - Marie-Galante
    TFFR (PTP) - Pointe-à-Pitre - Le Raizet Airport - Pointe-à-Pitre - Grande-Terre
    TFFS (LSS) - Les Saintes Airport - Terre-de-Haut - Les Saintes

Output:

TFFA (DSD) - La D?sirade Airport - Beaus?jour - La D?sirade
    TFFB (BBR) - Baillif Airport - Baillif - Basse-Terre
    TFFC (SFC) - Saint-Fran?ois Airport - Saint-Fran?ois - Grande-Terre
    TFFM (GBJ) - Marie-Galante Airport - Grand-Bourg - Marie-Galante
    TFFR (PTP) - Pointe-?-Pitre - Le Raizet Airport - Pointe-?-Pitre - Grande-Terre
    TFFS (LSS) - Les Saintes Airport - Terre-de-Haut - Les Saintes

It didn't seem to convert anything;

Am I doing something wrong?

It's after 11 here, Have to run it thru the debugger tomorrow.

Title: Re: Character Conversions
Post by: bytebites on September 22, 2019, 10:12:57 am
Code: Pascal  [Select][+][-]
  1. function toascii(s: string): string;
  2. type
  3.   USASCIIString = type ansistring(20127);
  4. begin
  5.   Result := USASCIIString(s);
  6. end;
  7.  

from stackoverflow
Title: Re: Character Conversions
Post by: munair on September 22, 2019, 11:26:43 am
Unicode systems are generally utf16, but that is not really the problem.

That would be Windows only then. Unix based OSs and networks (internet) primarily use UTF8. Quote from wikipedia:
Quote
UTF-16 is used internally by systems such as Windows, Java and JavaScript. It is also often used for plain text and for word-processing data files on Windows. It is rarely used for files on Unix/Linux or macOS. It never gained popularity on the web, where UTF-8 is dominant (and considered "the mandatory encoding for all [text]" by WHATWG[2]). UTF-16 is used by under 0.01% of web pages themselves.

That said, I wonder how much money and effort has been put in software development to correctly support Unicode. Problems already start with lower and upper case conversion, especially for non-Latin languages. The definition of graphemes is sometimes vague and can even lead to controversies:
Quote
Unicode, in intent, encodes the underlying characters—graphemes and grapheme-like units—rather than the variant glyphs (renderings) for such characters. In the case of Chinese characters, this sometimes leads to controversies over distinguishing the underlying character from its variant glyphs

The Unicode system has become so complex that the size of the consortium's basic support library takes up more than 60MB (sounds like a tiny OS in its own right). The only languages that are still guaranteed to render and convert correctly are those which writing systems are covered by basic ASCII, such as Dutch and English. These are also the most efficient as they use 1 byte per character in UTF8, which is non-trivial for network communications.
Title: Re: Character Conversions
Post by: winni on September 22, 2019, 07:15:54 pm
@JLWest

No, you are doing nothing wrong. I made a mistake in the replace function. Was too late last night ...

As attachment you get the correct version of utf8toAscii.

Winni
Title: Re: Character Conversions
Post by: marcov on September 22, 2019, 07:27:22 pm
Unicode systems are generally utf16, but that is not really the problem.

That would be Windows only then. Unix based OSs and networks (internet) primarily use UTF8. Quote from wikipedia:
Quote
UTF-16 is used internally by systems such as Windows, Java and JavaScript. It is also often used for plain text and for word-processing data files on Windows. It is rarely used for files on Unix/Linux or macOS. It never gained popularity on the web, where UTF-8 is dominant (and considered "the mandatory encoding for all [text]" by WHATWG[2]). UTF-16 is used by under 0.01% of web pages themselves.

Java, Mono and QT also exist on *nix, and are afaik primarily UTF-16.    Document encoding is something totally different from API encodings, so less relevant.

Quote
That said, I wonder how much money and effort has been put in software development to correctly support Unicode. Problems already start with lower and upper case conversion, especially for non-Latin languages. The definition of graphemes is sometimes vague and can even lead to controversies:
Quote
Unicode, in intent, encodes the underlying characters—graphemes and grapheme-like units—rather than the variant glyphs (renderings) for such characters. In the case of Chinese characters, this sometimes leads to controversies over distinguishing the underlying character from its variant glyphs

The Unicode system has become so complex that the size of the consortium's basic support library takes up more than 60MB (sounds like a tiny OS in its own right). The only languages that are still guaranteed to render and convert correctly are those which writing systems are covered by basic ASCII, such as Dutch and English. These are also the most efficient as they use 1 byte per character in UTF8, which is non-trivial for network communications.

Windows has all this in well defined APIs. *nix has iconv, but how it exactly works and which encodings it knows is less evident, and that is about it.
Title: Re: Character Conversions
Post by: winni on September 22, 2019, 07:28:27 pm
@bytebites

Strange idea but just tested:

all utf8 chars are printed as ?

No, it does not work.

Winni
Title: Re: Character Conversions
Post by: JLWest on September 22, 2019, 09:18:46 pm
I just tested it on a large file and it worked perfect.

This is great.
Title: Re: Character Conversions
Post by: Thaddy on September 23, 2019, 07:03:55 am
@bytebites

Strange idea but just tested:

all utf8 chars are printed as ?

No, it does not work.

Winni
Depends. Here's a demo, you are partially right and JLWest is wrong
Code: Pascal  [Select][+][-]
  1. {$mode delphi}{$H+}
  2. // this may fool you into thinking it always works!
  3. // make sure you prepare a [b]file[/b] in an [b]Ansi[/b] encoding
  4. // that supports French. Then it doesn't work every time.
  5. // it doesn't work at all for non-western code pages.
  6. const strings:string =
  7. 'TFFA (DSD) - La Désirade Airport - Beauséjour - La Désirade'+LineEnding+
  8. 'TFFB (BBR) - Baillif Airport - Baillif - Basse-Terre'+LineEnding+
  9. 'TFFC (SFC) - Saint-François Airport - Saint-François - Grande-Terre'+LineEnding+
  10. 'TFFM (GBJ) - Marie-Galante Airport - Grand-Bourg - Marie-Galante'+LineEnding+
  11. 'TFFR (PTP) - Pointe-à-Pitre - Le Raizet Airport - Pointe-à-Pitre - Grande-Terre'+LineEnding+
  12. 'TFFS (LSS) - Les Saintes Airport - Terre-de-Haut - Les Saintes';
  13.  
  14. function toascii(s: string): string;
  15. type
  16.   USASCIIString = type ansistring(20127);
  17. begin
  18.   Result := USASCIIString(s);
  19. end;
  20.  
  21. begin
  22.   writeln(ToAscii(Strings));
  23. end.

If you run this, it works, but the test is flawed because I used the capabilities of the editor.
if you run the same code using a text file the ???? start to appear. (Try a Lithuanian encoding - windows-1257 - still western, but with some twists in decoration, see Marco's remark - , or KOI-8, with which many forum users are familiar with)

It is still a very useful function, but not perfect. But it is actually short and pretty concise, which I like, except for Lithuanian.... so I have to read my wife's letters by guessing the question marks... ::)
Title: Re: Character Conversions
Post by: JLWest on September 23, 2019, 04:00:08 pm
@Thaddy

I'm just now testing bytebites function. The one that I said works was Winni function. I should get to howardpc code sometime later today or tomorrow.

I don't expect any of these functions to convert everything. With 11 million lines of texts files 90% would be fantastic.

It would be nice to have the ability to add characters that come back ?.
Title: Re: Character Conversions
Post by: winni on September 23, 2019, 04:57:15 pm
@thaddy

Your code does not work for me! Not writing on the console, not writing to a file, not using showMessage. The result is allways the same: the french spec chars are returned as '?'.

Linux, gtk2, KDE Plasma, Lazarzus 2.04, fpc 3.01

@JLWest

Yes, you can enlarge the constant array ar in the unit utf8toAsciiConvert. Every entry looks like that:
Code: Pascal  [Select][+][-]
  1. (u: '€'; a: 'EUR'),

u is the utf8-instring, a is the Ascii-outstring.

Dont forget to change then constant length in the array header!

Winni
Title: Re: Character Conversions
Post by: Thaddy on September 23, 2019, 06:43:59 pm
@thaddy
Your code does not work for me! Not writing on the console, not writing to a file, not using showMessage. The result is allways the same: the french spec chars are returned as '?'.
That was the intention of my example......
Title: Re: Character Conversions
Post by: JLWest on September 23, 2019, 06:53:07 pm
@Thaddy

"so I have to read my wife's letters by guessing the question marks."

No you don't, you don't even have to open the letters, Just send her another thousand.

Title: Re: Character Conversions
Post by: winni on September 23, 2019, 06:58:58 pm
@ Thaddy: Why?

The was no FreeAndNil in my code!

Title: Re: Character Conversions
Post by: JLWest on September 25, 2019, 11:02:19 pm
    function toascii(s: string): string;
    type
      USASCIIString = type ansistring(20127);
    begin
      Result := USASCIIString(s);
    end;
     
Using the above function I ran 24 files using this translation function.

The program produced 227 files (One per country).
Here is a list of the countries that did not translate:

Argentina.txt, Brazil.txt, Canada.txt, FaroeIslands.txt, Germany.txt, Iceland.txt, Italy.txt
Maldives.txt, Moldova.txt, Romania.txt, Thailand.txt, Tonga.txt, Uruguay.txt, Vietnam.txt

The 227 files have thousands of lines of data consisting of airport names cities states countries.
About 90 - 95 percent of the worlds airports.

I don't know what character sets it failed on. That is what character set dose Argentina have?

Actually I think that little 1 line function did fantastic.

@Thaddy

The text you refer to are airports in Guadeloupe. This is what I have in the TFile.txt and is put thru the translation. As you can see there are no non ASCII characters.

|*|Guadeloupe|3|City|Guadeloupe.txt|X|
    TFFA (DSD)|La Desirade Airport|Beausejour|La Desirade
    TFFB (BBR)|Baillif Airport|Baillif|Basse-Terre
    TFFC (SFC)|Saint-Francois Airport|Saint-Francois|Grande-Terre
    TFFM (GBJ)|Marie-Galante Airport|Grand-Bourg|Marie-Galante
    TFFR (PTP)|Pointe-a-Pitre|Le Raizet Airport|Pointe-a-Pitre

This is what's was copied from https://en.wikipedia.org/wiki/List_of_airports_by_ICAO_code:_T
Guadeloupe Definately have non ASCII characters.

Also see airport category and list.

    TFFA (DSD) – La Désirade Airport – Beauséjour, La Désirade
    TFFB (BBR) – Baillif Airport – Baillif, Basse-Terre
    TFFC (SFC) – Saint-François Airport – Saint-François, Grande-Terre
    TFFM (GBJ) – Marie-Galante Airport – Grand-Bourg, Marie-Galante
    TFFR (PTP) – Pointe-à-Pitre - Le Raizet Airport – Pointe-à-Pitre, Grande-Terre
    TFFS (LSS) – Les Saintes Airport – Terre-de-Haut, Les Saintes

This is what is in my Guadeloupe.txt file. 'I Cant explain?'

|TFFA|DSD|La Desirade Airport|Beausejour|La Desirade|Guadeloupe|
|TFFB|BBR|Baillif Airport|Baillif|Basse-Terre|Guadeloupe|
|TFFC|SFC|Saint-Francois Airport|Saint-Francois|Grande-Terre|Guadeloupe|
|TFFM|GBJ|Marie-Galante Airport|Grand-Bourg|Marie-Galante|Guadeloupe|
|TFFR|PTP|Pointe-a-Pitre|Le Raizet Airport|Pointe-a-Pitre|Guadeloupe|

So Now I test winni function, but just on the following files. Argentina.txt, Brazil.txt, Canada.txt, FaroeIslands.txt, Germany.txt, Iceland.txt, Italy.txt, Maldives.txt, Moldova.txt, Romania.txt, Thailand.txt, Tonga.txt, Uruguay.txt, Vietnam.txt.




 
Title: Re: Character Conversions
Post by: JLWest on September 28, 2019, 04:13:03 am
Just in the interest of completeness  I just finished testing howardpc translation program. He had a program and I changed it to a unit so I could use it with with any program as needed.

None  of the tree translations units could translate the Vietnam file.  howardpc code and winni were very close, maybe howards was a little better. Right now the program dose not  save the translated file. It's in ia listbox. Haven't decided on naming conventions.

if anyone is interested I would post the program with some test data.

Send me a message or add a reply. 
 
Title: Re: Character Conversions
Post by: winni on September 28, 2019, 01:44:22 pm
Hi!

As I wrote my code is for the utf8 characters of the european langanges.

One reason is that I don't know nothing about chinese or Sanskrit.
The second reason is the nearly "endless" utf8 table. What is necessary?
Do we need Cherokee?

Winni
Title: Re: Character Conversions
Post by: JLWest on September 29, 2019, 06:29:45 am
Hi!

As I wrote my code is for the utf8 characters of the european langanges.

One reason is that I don't know nothing about chinese or Sanskrit.
The second reason is the nearly "endless" utf8 table. What is necessary?
Do we need Cherokee?

Winni

That's fine winni. I just hand translated the file. I think it works great. Now I have all my data files translated.

Thanks one and All
Title: Re: Character Conversions
Post by: neuro on March 19, 2022, 10:44:18 am
if you run the same code using a text file the ???? start to appear. (Try a Lithuanian encoding - windows-1257 - still western, but with some twists in decoration, see Marco's remark - , or KOI-8, with which many forum users are familiar with)
It is still a very useful function, but not perfect. But it is actually short and pretty concise, which I like, except for Lithuanian.... so I have to read my wife's letters by guessing the question marks... ::)

Lithuanian CharSet Converter v.2.0
(free open-source cross-platform software)
        
Before UTF-8 character encoding adoption, Lithuanians had used different character encodings which were incompatible between each other.

“Lithuanian charset converter” converts between legacy character encodings and modern UTF-8.

“Lithuanian charset converter” converts between:
• ASCII;
• 772 / Lithuanian Standard LST 1284:1993 (Lithuanian and Russian characters) ; 774 / Lithuanian Standard LST 1283:1993 (Lithuanian and English characters) ; 775 (Microsoft);
• 770 / IBM Baltic / Lithuanian Standard RST 1095-89;
• 771 / KBL / Baltic Amadeus (Lithuanian and Russian characters) ; 773 Lithuanian (mix of 771 and 775);
• Windows-1257 / IBM Baltic RIM ; Latin-7 / ISO-8859-13;
• Latin-4 / ISO-8859-4 ; Latin-6 / ISO-8859-10;
• UTF-8 BOM (byte order mark);
• UTF-8.

LAMW source code for Android:
http://cognaxon.com/downloads/LithuanianCharSetConverter/SourceCode/LithuanianCharSetConverter_Lazarus_Android.zip

Lazarus source code for Linux, Windows, macOS:
http://cognaxon.com/downloads/LithuanianCharSetConverter/SourceCode/LithuanianCharSetConverter_Lazarus.tar.gz
Title: Re: Character Conversions
Post by: Fred vS on March 19, 2022, 01:05:07 pm
Hello.

Note that rendering of char is font dependent.

For example to list all the fonts compatible with Chinese ideograms:

In Linux (via terminal):
Code: Pascal  [Select][+][-]
  1. $> /usr/bin/fc-list :lang=zh --format="%{family[0]}\n" | sort | uniq

In Windows (resumed via EnumFontFamiliesEX from windows.pp) :
Code: Pascal  [Select][+][-]
  1.   lf.lfCharSet := 136 // Chineese
  2. ...
  3.    EnumFontFamiliesEX(DC, @lf, @EnumFontsNoDups, ptrint(L), 0);
TinyPortal © 2005-2018