Recent

Author Topic: Character Conversions  (Read 10512 times)

Birger52

  • Sr. Member
  • ****
  • Posts: 309
Re: Character Conversions
« Reply #15 on: September 20, 2019, 04:02:28 pm »
I think PHP  actually can do what you want.
https://www.php.net/manual/en/function.mb-convert-encoding.php

Lazarus 2.0.8 FPC 3.0.4
Win7 64bit
Playing and learning - strictly for my own pleasure.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11351
  • FPC developer.
Re: Character Conversions
« Reply #16 on: September 20, 2019, 04:21:56 pm »
I can't get the code to compile.

It gives me an error on line 33.
  Iconvert(s, tmp, 'UTF-8', 'ASCII//TRANSLIT');  <-- can't find this

It appears to be a Unix thing.

Iconv doesn't come with WIndows, it is an additional DLL.

There are variants of that DLL though, depending on which Unix-emulation-for-windows you use (mingw/cygwin etc), and of course 32-bit and 64-bit.

Moreover, it seems that the exact workings of the lib are a bit different on windows (or only some versions?) see https://bugs.freepascal.org/view.php?id=20531 .

This is why the iconv interface unit hasn't been enabled for Windows. Nobody tested it or provides fairly universally accepted dlls.

Personally I would see if I could make a mapping based on the unicode tables (shipped with FPC as part of the rtl-unicode package which plugs into unit character)

Note that rules of removal for accents might depend on language/country.

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: Character Conversions
« Reply #17 on: September 20, 2019, 05:17:41 pm »
@marcov

Yes, a table to simplify UTF8 down to ASCII would be great. And for this issue we dont need codepoints and other stuff. Just this:

Code: Pascal  [Select][+][-]
  1. TUtf8toASCII = record
  2.                             CharIn : TUTF8Char;
  3.                             CharOut: Char;
  4.                          end;
  5. TUtf8toASCIIArray = array of  TUtf8toASCII;
  6.  
And then fill the array with data. Is this a hard job or is the data somewhere around in the RTL?

Winni

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11351
  • FPC developer.
Re: Character Conversions
« Reply #18 on: September 20, 2019, 05:28:04 pm »
Unicode systems are generally utf16, but that is not really the problem.

The units with tables are in packages/rtl-unicode and the generators in utils/unicode.

The original tables can be downloaded from the Unicode consortium. Maybe you can modify one of the generators to create the table you want. 


JLWest

  • Hero Member
  • *****
  • Posts: 1293
Re: Character Conversions
« Reply #19 on: September 20, 2019, 07:48:24 pm »
@marcov

Yes, a table to simplify UTF8 down to ASCII would be great. And for this issue we dont need codepoints and other stuff. Just this:

Code: Pascal  [Select][+][-]
  1. TUtf8toASCII = record
  2.                             CharIn : TUTF8Char;
  3.                             CharOut: Char;
  4.                          end;
  5. TUtf8toASCIIArray = array of  TUtf8toASCII;
  6.  
And then fill the array with data. Is this a hard job or is the data somewhere around in the RTL?

Winni

So if I understand this right you are proposing an array of records with UTF8 Char and ASCII. Chars. Then do a search of the array based on the UTF8 char and replace it with the ASCII. Might work in most cases for most words. 

 
FPC 3.2.0, Lazarus IDE v2.0.4
 Windows 10 Pro 32-GB
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
4.1 TB

JLWest

  • Hero Member
  • *****
  • Posts: 1293
Re: Character Conversions
« Reply #20 on: September 20, 2019, 07:50:50 pm »
I think PHP  actually can do what you want.
https://www.php.net/manual/en/function.mb-convert-encoding.php

May it could. I couldn't tell and I don't know PHP.
FPC 3.2.0, Lazarus IDE v2.0.4
 Windows 10 Pro 32-GB
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
4.1 TB

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: Character Conversions
« Reply #21 on: September 20, 2019, 08:18:20 pm »
@JLWest

Yes, you got me!

Winni

howardpc

  • Hero Member
  • *****
  • Posts: 4144
Re: Character Conversions
« Reply #22 on: September 21, 2019, 02:02:06 pm »
Here's an example using a simple function that should work on Windows (not tested) which does not use iconvenc.

The NameChrToASCII function works fine on Linux, and has no linux-specific dependencies so should be OK on Windows. it is far from comprehensive, but certainly covers all accented Unicode codepoints in JLWest's example.

Code: Pascal  [Select][+][-]
  1. program TestUTF8ToASCII;
  2.  
  3. {$mode objfpc}{$H+}
  4. {$IfDef windows}
  5. {$AppType console}
  6. {$EndIf}
  7.  
  8. uses
  9.   Classes, LazUTF8, Types;
  10.  
  11. function NameChrToASCII(aUTF8Codepoint: String): Char;
  12. begin
  13.   if Length(aUTF8Codepoint) > 2 then
  14.     Exit('?');
  15.   case aUTF8Codepoint of
  16.     'À': Exit('A');
  17.     'Á': Exit('A');
  18.     'Â': Exit('A');
  19.     'Ã': Exit('A');
  20.     'Ä': Exit('A');
  21.     'Å': Exit('A');
  22.     'Æ': Exit('A');
  23.     'Ç': Exit('C');
  24.     'È': Exit('E');
  25.     'É': Exit('E');
  26.     'Ê': Exit('E');
  27.     'Ë': Exit('E');
  28.     'Ì': Exit('I');
  29.     'Í': Exit('I');
  30.     'Î': Exit('I');
  31.     'Ï': Exit('I');
  32.     'Ð': Exit('D');
  33.     'Ñ': Exit('N');
  34.     'Ò': Exit('O');
  35.     'Ó': Exit('O');
  36.     'Ô': Exit('O');
  37.     'Õ': Exit('O');
  38.     'Ö': Exit('O');
  39.     '×': Exit('x');
  40.     'Ø': Exit('O');
  41.     'Ù': Exit('U');
  42.     'Ú': Exit('U');
  43.     'Û': Exit('U');
  44.     'Ü': Exit('U');
  45.     'Ý': Exit('Y');
  46.     'Þ': Exit('T');
  47.     'ß': Exit('s');
  48.     'à': Exit('a');
  49.     'á': Exit('a');
  50.     'â': Exit('a');
  51.     'ã': Exit('a');
  52.     'ä': Exit('a');
  53.     'å': Exit('a');
  54.     'æ': Exit('a');
  55.     'ç': Exit('c');
  56.     'è': Exit('e');
  57.     'é': Exit('e');
  58.     'ê': Exit('e');
  59.     'ë': Exit('e');
  60.     'ì': Exit('i');
  61.     'í': Exit('i');
  62.     'î': Exit('i');
  63.     'ï': Exit('i');
  64.     'ð': Exit('d');
  65.     'ñ': Exit('n');
  66.     'ò': Exit('o');
  67.     'ó': Exit('o');
  68.     'ô': Exit('o');
  69.     'õ': Exit('o');
  70.     'ö': Exit('o');
  71.     'ø': Exit('o');
  72.     'ù': Exit('u');
  73.     'ú': Exit('u');
  74.     'û': Exit('u');
  75.     'ü': Exit('u');
  76.     'ý': Exit('y');
  77.     'þ': Exit('t');
  78.     'ÿ': Exit('y');
  79.     'Ā': Exit('A');
  80.     'ā': Exit('a');
  81.     'Ă': Exit('A');
  82.     'ă': Exit('a');
  83.     'Ą': Exit('A');
  84.     'ą': Exit('a');
  85.     'Ć': Exit('C');
  86.     'ć': Exit('c');
  87.     'Ĉ': Exit('C');
  88.     'ĉ': Exit('c');
  89.     'Ċ': Exit('C');
  90.     'ċ': Exit('c');
  91.     'Č': Exit('C');
  92.     'č': Exit('c');
  93.     'Ď': Exit('D');
  94.     'ď': Exit('d');
  95.     'Đ': Exit('D');
  96.     'đ': Exit('d');
  97.     'Ē': Exit('E');
  98.     'ē': Exit('e');
  99.     'Ĕ': Exit('E');
  100.     'ĕ': Exit('e');
  101.     'Ė': Exit('E');
  102.     'ė': Exit('e');
  103.     'Ę': Exit('E');
  104.     'ę': Exit('e');
  105.     'Ě': Exit('E');
  106.     'ě': Exit('e');
  107.     'Ĝ': Exit('G');
  108.     'ĝ': Exit('g');
  109.     'Ğ': Exit('G');
  110.     'ğ': Exit('g');
  111.     'Ġ': Exit('G');
  112.     'ġ': Exit('g');
  113.     'Ģ': Exit('G');
  114.     'ģ': Exit('g');
  115.     'Ĥ': Exit('H');
  116.     'ĥ': Exit('h');
  117.     'Ħ': Exit('H');
  118.     'ħ': Exit('h');
  119.     'Ĩ': Exit('I');
  120.     'ĩ': Exit('i');
  121.     'Ī': Exit('I');
  122.     'ī': Exit('i');
  123.     'Ĭ': Exit('I');
  124.     'ĭ': Exit('i');
  125.     'Į': Exit('I');
  126.     'į': Exit('i');
  127.     'İ': Exit('I');
  128.     'ı': Exit('i');
  129.     'IJ': Exit('I');
  130.     'ij': Exit('i');
  131.     'Ĵ': Exit('J');
  132.     'ĵ': Exit('j');
  133.     'Ķ': Exit('K');
  134.     'ķ': Exit('k');
  135.     'ĸ': Exit('q');
  136.     'Ĺ': Exit('L');
  137.     'ĺ': Exit('l');
  138.     'Ļ': Exit('L');
  139.     'ļ': Exit('l');
  140.     'Ľ': Exit('L');
  141.     'ľ': Exit('l');
  142.     'Ŀ': Exit('L');
  143.     'ŀ': Exit('l');
  144.     'Ł': Exit('L');
  145.     'ł': Exit('l');
  146.     'Ń': Exit('N');
  147.     'ń': Exit('n');
  148.     'Ņ': Exit('N');
  149.     'ņ': Exit('n');
  150.     'Ň': Exit('N');
  151.     'ň': Exit('n');
  152.     'Ŋ': Exit('N');
  153.     'ŋ': Exit('n');
  154.     'Ō': Exit('O');
  155.     'ō': Exit('o');
  156.     'Ŏ': Exit('O');
  157.     'ŏ': Exit('o');
  158.     'Ő': Exit('O');
  159.     'ő': Exit('o');
  160.     'Œ': Exit('O');
  161.     'œ': Exit('o');
  162.     'Ŕ': Exit('R');
  163.     'ŕ': Exit('r');
  164.     'Ŗ': Exit('R');
  165.     'ŗ': Exit('r');
  166.     'Ř': Exit('R');
  167.     'ř': Exit('r');
  168.     'Ś': Exit('S');
  169.     'ś': Exit('s');
  170.     'Ŝ': Exit('S');
  171.     'ŝ': Exit('s');
  172.     'Ş': Exit('S');
  173.     'ş': Exit('s');
  174.     'Š': Exit('S');
  175.     'š': Exit('s');
  176.     'Ţ': Exit('T');
  177.     'ţ': Exit('t');
  178.     'Ť': Exit('T');
  179.     'ť': Exit('t');
  180.     'Ŧ': Exit('T');
  181.     'ŧ': Exit('t');
  182.     'Ũ': Exit('U');
  183.     'ũ': Exit('u');
  184.     'Ū': Exit('U');
  185.     'ū': Exit('u');
  186.     'Ŭ': Exit('U');
  187.     'ŭ': Exit('u');
  188.     'Ů': Exit('U');
  189.     'ů': Exit('u');
  190.     'Ű': Exit('U');
  191.     'ű': Exit('u');
  192.     'Ų': Exit('U');
  193.     'ų': Exit('u');
  194.     'Ŵ': Exit('W');
  195.     'ŵ': Exit('w');
  196.     'Ŷ': Exit('Y');
  197.     'ŷ': Exit('y');
  198.     'Ÿ': Exit('Y');
  199.     'Ź': Exit('Z');
  200.     'ź': Exit('z');
  201.     'Ż': Exit('Z');
  202.     'ż': Exit('z');
  203.     'Ž': Exit('Z');
  204.     'ž': Exit('z');
  205.     'ſ': Exit('s');
  206.     'ƀ': Exit('b');
  207.     'Ɓ': Exit('B');
  208.     'Ƃ': Exit('B');
  209.     'ƃ': Exit('b');
  210.     'Ƈ': Exit('C');
  211.     'ƈ': Exit('c');
  212.     'Ɖ': Exit('D');
  213.     'Ɗ': Exit('D');
  214.     'Ƌ': Exit('D');
  215.     'ƌ': Exit('d');
  216.     'Ɛ': Exit('E');
  217.     'Ƒ': Exit('F');
  218.     'ƒ': Exit('f');
  219.     'Ɠ': Exit('G');
  220.     'ƕ': Exit('h');
  221.     'Ɩ': Exit('I');
  222.     'Ɨ': Exit('I');
  223.     'Ƙ': Exit('K');
  224.     'ƙ': Exit('k');
  225.     'ƚ': Exit('l');
  226.     'Ɲ': Exit('N');
  227.     'ƞ': Exit('n');
  228.     'Ơ': Exit('O');
  229.     'ơ': Exit('o');
  230.     'Ƣ': Exit('O');
  231.     'ƣ': Exit('o');
  232.     'Ƥ': Exit('P');
  233.     'ƥ': Exit('p');
  234.     'ƫ': Exit('t');
  235.     'Ƭ': Exit('T');
  236.     'ƭ': Exit('t');
  237.     'Ʈ': Exit('T');
  238.     'Ư': Exit('U');
  239.     'ư': Exit('u');
  240.     'Ʋ': Exit('V');
  241.     'Ƴ': Exit('Y');
  242.     'ƴ': Exit('y');
  243.     'Ƶ': Exit('Z');
  244.     'ƶ': Exit('z');
  245.     'LJ': Exit('L');
  246.     'Lj': Exit('L');
  247.     'lj': Exit('l');
  248.     'NJ': Exit('N');
  249.     'Nj': Exit('N');
  250.     'nj': Exit('n');
  251.     'Ǎ': Exit('A');
  252.     'ǎ': Exit('a');
  253.     'Ǐ': Exit('I');
  254.     'ǐ': Exit('i');
  255.     'Ǒ': Exit('O');
  256.     'ǒ': Exit('o');
  257.     'Ǔ': Exit('U');
  258.     'ǔ': Exit('u');
  259.     'Ǖ': Exit('U');
  260.     'ǖ': Exit('u');
  261.     'Ǘ': Exit('U');
  262.     'ǘ': Exit('u');
  263.     'Ǚ': Exit('U');
  264.     'ǚ': Exit('u');
  265.     'Ǜ': Exit('U');
  266.     'ǜ': Exit('u');
  267.     'Ǟ': Exit('A');
  268.     'ǟ': Exit('a');
  269.     'Ǡ': Exit('A');
  270.     'ǡ': Exit('a');
  271.     'Ǣ': Exit('A');
  272.     'ǣ': Exit('a');
  273.     'Ǥ': Exit('G');
  274.     'ǥ': Exit('g');
  275.     'Ǧ': Exit('G');
  276.     'ǧ': Exit('g');
  277.     'Ǩ': Exit('K');
  278.     'ǩ': Exit('k');
  279.     'Ǫ': Exit('O');
  280.     'ǫ': Exit('o');
  281.     'Ǭ': Exit('O');
  282.     'ǭ': Exit('o');
  283.     'ǰ': Exit('j');
  284.     'DZ': Exit('D');
  285.     'Dz': Exit('D');
  286.     'dz': Exit('d');
  287.     'Ǵ': Exit('G');
  288.     'ǵ': Exit('g');
  289.     'Ǹ': Exit('N');
  290.     'ǹ': Exit('n');
  291.     'Ǻ': Exit('A');
  292.     'ǻ': Exit('a');
  293.     'Ǽ': Exit('A');
  294.     'ǽ': Exit('a');
  295.     'Ǿ': Exit('O');
  296.     'ǿ': Exit('o');
  297.     'Ȁ': Exit('A');
  298.     'ȁ': Exit('a');
  299.     'Ȃ': Exit('A');
  300.     'ȃ': Exit('a');
  301.     'Ȅ': Exit('E');
  302.     'ȅ': Exit('e');
  303.     'Ȇ': Exit('E');
  304.     'ȇ': Exit('e');
  305.     'Ȉ': Exit('I');
  306.     'ȉ': Exit('i');
  307.     'Ȋ': Exit('I');
  308.     'ȋ': Exit('i');
  309.     'Ȍ': Exit('O');
  310.     'ȍ': Exit('o');
  311.     'Ȏ': Exit('O');
  312.     'ȏ': Exit('o');
  313.     'Ȑ': Exit('R');
  314.     'ȑ': Exit('r');
  315.     'Ȓ': Exit('R');
  316.     'ȓ': Exit('r');
  317.     'Ȕ': Exit('U');
  318.     'ȕ': Exit('u');
  319.     'Ȗ': Exit('U');
  320.     'ȗ': Exit('u');
  321.     'Ș': Exit('S');
  322.     'ș': Exit('s');
  323.     'Ț': Exit('T');
  324.     'ț': Exit('t');
  325.     'Ȟ': Exit('H');
  326.     'ȟ': Exit('h');
  327.     'ȡ': Exit('d');
  328.     'Ȥ': Exit('Z');
  329.     'ȥ': Exit('z');
  330.     'Ȧ': Exit('A');
  331.     'ȧ': Exit('a');
  332.     'Ȩ': Exit('E');
  333.     'ȩ': Exit('e');
  334.     'Ȫ': Exit('O');
  335.     'ȫ': Exit('o');
  336.     'Ȭ': Exit('O');
  337.     'ȭ': Exit('o');
  338.     'Ȯ': Exit('O');
  339.     'ȯ': Exit('o');
  340.     'Ȱ': Exit('O');
  341.     'ȱ': Exit('o');
  342.     'Ȳ': Exit('Y');
  343.     'ȳ': Exit('y');
  344.     'ȴ': Exit('l');
  345.     'ȵ': Exit('n');
  346.     'ȶ': Exit('t');
  347.     'ȷ': Exit('j');
  348.     'ȸ': Exit('d');
  349.     'ȹ': Exit('q');
  350.     'Ⱥ': Exit('A');
  351.     'Ȼ': Exit('C');
  352.     'ȼ': Exit('c');
  353.     'Ƚ': Exit('L');
  354.     'Ⱦ': Exit('T');
  355.     'ȿ': Exit('s');
  356.     'ɀ': Exit('z');
  357.     'Ƀ': Exit('B');
  358.     'Ʉ': Exit('U');
  359.     'Ɇ': Exit('E');
  360.     'ɇ': Exit('e');
  361.     'Ɉ': Exit('J');
  362.     'ɉ': Exit('j');
  363.     'Ɍ': Exit('R');
  364.     'ɍ': Exit('r');
  365.     'Ɏ': Exit('Y');
  366.     'ɏ': Exit('y');
  367.     'ɓ': Exit('b');
  368.     'ɕ': Exit('c');
  369.     'ɖ': Exit('d');
  370.     'ɗ': Exit('d');
  371.     'ɛ': Exit('e');
  372.     'ɟ': Exit('j');
  373.     'ɠ': Exit('g');
  374.     'ɡ': Exit('g');
  375.     'ɢ': Exit('G');
  376.     'ɦ': Exit('h');
  377.     'ɧ': Exit('h');
  378.     'ɨ': Exit('i');
  379.     'ɪ': Exit('I');
  380.     'ɫ': Exit('l');
  381.     'ɬ': Exit('l');
  382.     'ɭ': Exit('l');
  383.     'ɱ': Exit('m');
  384.     'ɲ': Exit('n');
  385.     'ɳ': Exit('n');
  386.     'ɴ': Exit('N');
  387.     'ɶ': Exit('O');
  388.     'ɼ': Exit('r');
  389.     'ɽ': Exit('r');
  390.     'ɾ': Exit('r');
  391.     'ʀ': Exit('R');
  392.     'ʂ': Exit('s');
  393.     'ʈ': Exit('t');
  394.     'ʉ': Exit('u');
  395.     'ʋ': Exit('v');
  396.     'ʏ': Exit('Y');
  397.     'ʐ': Exit('z');
  398.     'ʑ': Exit('z');
  399.     'ʙ': Exit('B');
  400.     'ʛ': Exit('G');
  401.     'ʜ': Exit('H');
  402.     'ʝ': Exit('j');
  403.     'ʟ': Exit('L');
  404.     'ʠ': Exit('q');
  405.     'ʣ': Exit('d');
  406.     'ʥ': Exit('d');
  407.     'ʦ': Exit('t');
  408.     'ʪ': Exit('l');
  409.     'ʫ': Exit('l');
  410.     'ʰ': Exit('h');
  411.     'ʲ': Exit('j');
  412.     'ʳ': Exit('r');
  413.     'ʷ': Exit('w');
  414.     'ʸ': Exit('y');
  415.     'ˡ': Exit('l');
  416.     'ˢ': Exit('s');
  417.     'ˣ': Exit('x');
  418.     else Exit('?');
  419.   end;
  420. end;
  421.  
  422. function AccentedNameToAscii(aUTF8Name: String): String;
  423. var
  424.   p, pEnd: Pchar;
  425.   i, j: Integer;
  426.   s: String;
  427. begin
  428.   Result := '';
  429.   p := PChar(aUTF8Name);
  430.   pEnd := p;
  431.   Inc(pEnd, Length(aUTF8Name));
  432.   repeat
  433.     i := UTF8CodepointSize(p);
  434.     case i of
  435.       1: Result += p^;
  436.       else
  437.         begin
  438.           SetLength(s{%H-}, i);
  439.           for j := 1 to i do
  440.             begin
  441.               Inc(p, j-1);
  442.               s[j] := p^;
  443.             end;
  444.           Result += NameChrToASCII(s);
  445.         end;
  446.     end;
  447.     Inc(p);
  448.   until p >= pEnd;
  449. end;
  450.  
  451. var
  452.   strs: TStringDynArray;
  453.   s: String;
  454.  
  455. begin
  456.   strs := TStringDynArray.Create('Les Bruyères', 'Centre Médical Héliporté',
  457.                                   'Vésale Heliport', 'Saïss Airport',
  458.                                   'Fès-Boulemane', 'Léopold', 'Kédougou',
  459.                                   'Cesária', 'Évora', 'São', 'Ploče', 'Otočac',
  460.                                   'Čakovec', 'Almería', 'León', 'León',
  461.                                   'Logroño-Agoncillo', 'Suárez', 'Compiègne',
  462.                                   'Tréport', 'Périgueux', 'Targé', 'Châtellerault',
  463.                                   'Épernay', 'Pápa', 'Pécs-Pogány', 'Győr-Pér', 'Pér');
  464.   for s in strs do
  465.     Writeln(s,'  ->  "',AccentedNameToAscii(s),'"');
  466.   WriteLn(#10'Press [Enter] to finish');
  467.   ReadLn;
  468. end.

valdir.marcos

  • Hero Member
  • *****
  • Posts: 1106
Re: Character Conversions
« Reply #23 on: September 21, 2019, 02:58:29 pm »
Here's an example using a simple function that should work on Windows (not tested) which does not use iconvenc.

The NameChrToASCII function works fine on Linux, and has no linux-specific dependencies so should be OK on Windows. it is far from comprehensive, but certainly covers all accented Unicode codepoints in JLWest's example.

Code: Pascal  [Select][+][-]
  1. program TestUTF8ToASCII;
  2.  
  3. {$mode objfpc}{$H+}
  4. {$IfDef windows}
  5. {$AppType console}
  6. {$EndIf}
  7.  
  8. uses
  9.   Classes, LazUTF8, Types;
  10.  
  11. function NameChrToASCII(aUTF8Codepoint: String): Char;
  12. begin
  13.   if Length(aUTF8Codepoint) > 2 then
  14.     Exit('?');
  15.   case aUTF8Codepoint of
  16.     'À': Exit('A');
  17.     'Á': Exit('A');
  18.     'Â': Exit('A');
  19.     'Ã': Exit('A');
  20.     'Ä': Exit('A');
  21.     'Å': Exit('A');
  22.     'Æ': Exit('A');
  23.     'Ç': Exit('C');
  24.     'È': Exit('E');
  25.     'É': Exit('E');
  26.     'Ê': Exit('E');
  27.     'Ë': Exit('E');
  28.     'Ì': Exit('I');
  29.     'Í': Exit('I');
  30.     'Î': Exit('I');
  31.     'Ï': Exit('I');
  32.     'Ð': Exit('D');
  33.     'Ñ': Exit('N');
  34.     'Ò': Exit('O');
  35.     'Ó': Exit('O');
  36.     'Ô': Exit('O');
  37.     'Õ': Exit('O');
  38.     'Ö': Exit('O');
  39.     '×': Exit('x');
  40.     'Ø': Exit('O');
  41.     'Ù': Exit('U');
  42.     'Ú': Exit('U');
  43.     'Û': Exit('U');
  44.     'Ü': Exit('U');
  45.     'Ý': Exit('Y');
  46.     'Þ': Exit('T');
  47.     'ß': Exit('s');
  48.     'à': Exit('a');
  49.     'á': Exit('a');
  50.     'â': Exit('a');
  51.     'ã': Exit('a');
  52.     'ä': Exit('a');
  53.     'å': Exit('a');
  54.     'æ': Exit('a');
  55.     'ç': Exit('c');
  56.     'è': Exit('e');
  57.     'é': Exit('e');
  58.     'ê': Exit('e');
  59.     'ë': Exit('e');
  60.     'ì': Exit('i');
  61.     'í': Exit('i');
  62.     'î': Exit('i');
  63.     'ï': Exit('i');
  64.     'ð': Exit('d');
  65.     'ñ': Exit('n');
  66.     'ò': Exit('o');
  67.     'ó': Exit('o');
  68.     'ô': Exit('o');
  69.     'õ': Exit('o');
  70.     'ö': Exit('o');
  71.     'ø': Exit('o');
  72.     'ù': Exit('u');
  73.     'ú': Exit('u');
  74.     'û': Exit('u');
  75.     'ü': Exit('u');
  76.     'ý': Exit('y');
  77.     'þ': Exit('t');
  78.     'ÿ': Exit('y');
  79.     'Ā': Exit('A');
  80.     'ā': Exit('a');
  81.     'Ă': Exit('A');
  82.     'ă': Exit('a');
  83.     'Ą': Exit('A');
  84.     'ą': Exit('a');
  85.     'Ć': Exit('C');
  86.     'ć': Exit('c');
  87.     'Ĉ': Exit('C');
  88.     'ĉ': Exit('c');
  89.     'Ċ': Exit('C');
  90.     'ċ': Exit('c');
  91.     'Č': Exit('C');
  92.     'č': Exit('c');
  93.     'Ď': Exit('D');
  94.     'ď': Exit('d');
  95.     'Đ': Exit('D');
  96.     'đ': Exit('d');
  97.     'Ē': Exit('E');
  98.     'ē': Exit('e');
  99.     'Ĕ': Exit('E');
  100.     'ĕ': Exit('e');
  101.     'Ė': Exit('E');
  102.     'ė': Exit('e');
  103.     'Ę': Exit('E');
  104.     'ę': Exit('e');
  105.     'Ě': Exit('E');
  106.     'ě': Exit('e');
  107.     'Ĝ': Exit('G');
  108.     'ĝ': Exit('g');
  109.     'Ğ': Exit('G');
  110.     'ğ': Exit('g');
  111.     'Ġ': Exit('G');
  112.     'ġ': Exit('g');
  113.     'Ģ': Exit('G');
  114.     'ģ': Exit('g');
  115.     'Ĥ': Exit('H');
  116.     'ĥ': Exit('h');
  117.     'Ħ': Exit('H');
  118.     'ħ': Exit('h');
  119.     'Ĩ': Exit('I');
  120.     'ĩ': Exit('i');
  121.     'Ī': Exit('I');
  122.     'ī': Exit('i');
  123.     'Ĭ': Exit('I');
  124.     'ĭ': Exit('i');
  125.     'Į': Exit('I');
  126.     'į': Exit('i');
  127.     'İ': Exit('I');
  128.     'ı': Exit('i');
  129.     'IJ': Exit('I');
  130.     'ij': Exit('i');
  131.     'Ĵ': Exit('J');
  132.     'ĵ': Exit('j');
  133.     'Ķ': Exit('K');
  134.     'ķ': Exit('k');
  135.     'ĸ': Exit('q');
  136.     'Ĺ': Exit('L');
  137.     'ĺ': Exit('l');
  138.     'Ļ': Exit('L');
  139.     'ļ': Exit('l');
  140.     'Ľ': Exit('L');
  141.     'ľ': Exit('l');
  142.     'Ŀ': Exit('L');
  143.     'ŀ': Exit('l');
  144.     'Ł': Exit('L');
  145.     'ł': Exit('l');
  146.     'Ń': Exit('N');
  147.     'ń': Exit('n');
  148.     'Ņ': Exit('N');
  149.     'ņ': Exit('n');
  150.     'Ň': Exit('N');
  151.     'ň': Exit('n');
  152.     'Ŋ': Exit('N');
  153.     'ŋ': Exit('n');
  154.     'Ō': Exit('O');
  155.     'ō': Exit('o');
  156.     'Ŏ': Exit('O');
  157.     'ŏ': Exit('o');
  158.     'Ő': Exit('O');
  159.     'ő': Exit('o');
  160.     'Œ': Exit('O');
  161.     'œ': Exit('o');
  162.     'Ŕ': Exit('R');
  163.     'ŕ': Exit('r');
  164.     'Ŗ': Exit('R');
  165.     'ŗ': Exit('r');
  166.     'Ř': Exit('R');
  167.     'ř': Exit('r');
  168.     'Ś': Exit('S');
  169.     'ś': Exit('s');
  170.     'Ŝ': Exit('S');
  171.     'ŝ': Exit('s');
  172.     'Ş': Exit('S');
  173.     'ş': Exit('s');
  174.     'Š': Exit('S');
  175.     'š': Exit('s');
  176.     'Ţ': Exit('T');
  177.     'ţ': Exit('t');
  178.     'Ť': Exit('T');
  179.     'ť': Exit('t');
  180.     'Ŧ': Exit('T');
  181.     'ŧ': Exit('t');
  182.     'Ũ': Exit('U');
  183.     'ũ': Exit('u');
  184.     'Ū': Exit('U');
  185.     'ū': Exit('u');
  186.     'Ŭ': Exit('U');
  187.     'ŭ': Exit('u');
  188.     'Ů': Exit('U');
  189.     'ů': Exit('u');
  190.     'Ű': Exit('U');
  191.     'ű': Exit('u');
  192.     'Ų': Exit('U');
  193.     'ų': Exit('u');
  194.     'Ŵ': Exit('W');
  195.     'ŵ': Exit('w');
  196.     'Ŷ': Exit('Y');
  197.     'ŷ': Exit('y');
  198.     'Ÿ': Exit('Y');
  199.     'Ź': Exit('Z');
  200.     'ź': Exit('z');
  201.     'Ż': Exit('Z');
  202.     'ż': Exit('z');
  203.     'Ž': Exit('Z');
  204.     'ž': Exit('z');
  205.     'ſ': Exit('s');
  206.     'ƀ': Exit('b');
  207.     'Ɓ': Exit('B');
  208.     'Ƃ': Exit('B');
  209.     'ƃ': Exit('b');
  210.     'Ƈ': Exit('C');
  211.     'ƈ': Exit('c');
  212.     'Ɖ': Exit('D');
  213.     'Ɗ': Exit('D');
  214.     'Ƌ': Exit('D');
  215.     'ƌ': Exit('d');
  216.     'Ɛ': Exit('E');
  217.     'Ƒ': Exit('F');
  218.     'ƒ': Exit('f');
  219.     'Ɠ': Exit('G');
  220.     'ƕ': Exit('h');
  221.     'Ɩ': Exit('I');
  222.     'Ɨ': Exit('I');
  223.     'Ƙ': Exit('K');
  224.     'ƙ': Exit('k');
  225.     'ƚ': Exit('l');
  226.     'Ɲ': Exit('N');
  227.     'ƞ': Exit('n');
  228.     'Ơ': Exit('O');
  229.     'ơ': Exit('o');
  230.     'Ƣ': Exit('O');
  231.     'ƣ': Exit('o');
  232.     'Ƥ': Exit('P');
  233.     'ƥ': Exit('p');
  234.     'ƫ': Exit('t');
  235.     'Ƭ': Exit('T');
  236.     'ƭ': Exit('t');
  237.     'Ʈ': Exit('T');
  238.     'Ư': Exit('U');
  239.     'ư': Exit('u');
  240.     'Ʋ': Exit('V');
  241.     'Ƴ': Exit('Y');
  242.     'ƴ': Exit('y');
  243.     'Ƶ': Exit('Z');
  244.     'ƶ': Exit('z');
  245.     'LJ': Exit('L');
  246.     'Lj': Exit('L');
  247.     'lj': Exit('l');
  248.     'NJ': Exit('N');
  249.     'Nj': Exit('N');
  250.     'nj': Exit('n');
  251.     'Ǎ': Exit('A');
  252.     'ǎ': Exit('a');
  253.     'Ǐ': Exit('I');
  254.     'ǐ': Exit('i');
  255.     'Ǒ': Exit('O');
  256.     'ǒ': Exit('o');
  257.     'Ǔ': Exit('U');
  258.     'ǔ': Exit('u');
  259.     'Ǖ': Exit('U');
  260.     'ǖ': Exit('u');
  261.     'Ǘ': Exit('U');
  262.     'ǘ': Exit('u');
  263.     'Ǚ': Exit('U');
  264.     'ǚ': Exit('u');
  265.     'Ǜ': Exit('U');
  266.     'ǜ': Exit('u');
  267.     'Ǟ': Exit('A');
  268.     'ǟ': Exit('a');
  269.     'Ǡ': Exit('A');
  270.     'ǡ': Exit('a');
  271.     'Ǣ': Exit('A');
  272.     'ǣ': Exit('a');
  273.     'Ǥ': Exit('G');
  274.     'ǥ': Exit('g');
  275.     'Ǧ': Exit('G');
  276.     'ǧ': Exit('g');
  277.     'Ǩ': Exit('K');
  278.     'ǩ': Exit('k');
  279.     'Ǫ': Exit('O');
  280.     'ǫ': Exit('o');
  281.     'Ǭ': Exit('O');
  282.     'ǭ': Exit('o');
  283.     'ǰ': Exit('j');
  284.     'DZ': Exit('D');
  285.     'Dz': Exit('D');
  286.     'dz': Exit('d');
  287.     'Ǵ': Exit('G');
  288.     'ǵ': Exit('g');
  289.     'Ǹ': Exit('N');
  290.     'ǹ': Exit('n');
  291.     'Ǻ': Exit('A');
  292.     'ǻ': Exit('a');
  293.     'Ǽ': Exit('A');
  294.     'ǽ': Exit('a');
  295.     'Ǿ': Exit('O');
  296.     'ǿ': Exit('o');
  297.     'Ȁ': Exit('A');
  298.     'ȁ': Exit('a');
  299.     'Ȃ': Exit('A');
  300.     'ȃ': Exit('a');
  301.     'Ȅ': Exit('E');
  302.     'ȅ': Exit('e');
  303.     'Ȇ': Exit('E');
  304.     'ȇ': Exit('e');
  305.     'Ȉ': Exit('I');
  306.     'ȉ': Exit('i');
  307.     'Ȋ': Exit('I');
  308.     'ȋ': Exit('i');
  309.     'Ȍ': Exit('O');
  310.     'ȍ': Exit('o');
  311.     'Ȏ': Exit('O');
  312.     'ȏ': Exit('o');
  313.     'Ȑ': Exit('R');
  314.     'ȑ': Exit('r');
  315.     'Ȓ': Exit('R');
  316.     'ȓ': Exit('r');
  317.     'Ȕ': Exit('U');
  318.     'ȕ': Exit('u');
  319.     'Ȗ': Exit('U');
  320.     'ȗ': Exit('u');
  321.     'Ș': Exit('S');
  322.     'ș': Exit('s');
  323.     'Ț': Exit('T');
  324.     'ț': Exit('t');
  325.     'Ȟ': Exit('H');
  326.     'ȟ': Exit('h');
  327.     'ȡ': Exit('d');
  328.     'Ȥ': Exit('Z');
  329.     'ȥ': Exit('z');
  330.     'Ȧ': Exit('A');
  331.     'ȧ': Exit('a');
  332.     'Ȩ': Exit('E');
  333.     'ȩ': Exit('e');
  334.     'Ȫ': Exit('O');
  335.     'ȫ': Exit('o');
  336.     'Ȭ': Exit('O');
  337.     'ȭ': Exit('o');
  338.     'Ȯ': Exit('O');
  339.     'ȯ': Exit('o');
  340.     'Ȱ': Exit('O');
  341.     'ȱ': Exit('o');
  342.     'Ȳ': Exit('Y');
  343.     'ȳ': Exit('y');
  344.     'ȴ': Exit('l');
  345.     'ȵ': Exit('n');
  346.     'ȶ': Exit('t');
  347.     'ȷ': Exit('j');
  348.     'ȸ': Exit('d');
  349.     'ȹ': Exit('q');
  350.     'Ⱥ': Exit('A');
  351.     'Ȼ': Exit('C');
  352.     'ȼ': Exit('c');
  353.     'Ƚ': Exit('L');
  354.     'Ⱦ': Exit('T');
  355.     'ȿ': Exit('s');
  356.     'ɀ': Exit('z');
  357.     'Ƀ': Exit('B');
  358.     'Ʉ': Exit('U');
  359.     'Ɇ': Exit('E');
  360.     'ɇ': Exit('e');
  361.     'Ɉ': Exit('J');
  362.     'ɉ': Exit('j');
  363.     'Ɍ': Exit('R');
  364.     'ɍ': Exit('r');
  365.     'Ɏ': Exit('Y');
  366.     'ɏ': Exit('y');
  367.     'ɓ': Exit('b');
  368.     'ɕ': Exit('c');
  369.     'ɖ': Exit('d');
  370.     'ɗ': Exit('d');
  371.     'ɛ': Exit('e');
  372.     'ɟ': Exit('j');
  373.     'ɠ': Exit('g');
  374.     'ɡ': Exit('g');
  375.     'ɢ': Exit('G');
  376.     'ɦ': Exit('h');
  377.     'ɧ': Exit('h');
  378.     'ɨ': Exit('i');
  379.     'ɪ': Exit('I');
  380.     'ɫ': Exit('l');
  381.     'ɬ': Exit('l');
  382.     'ɭ': Exit('l');
  383.     'ɱ': Exit('m');
  384.     'ɲ': Exit('n');
  385.     'ɳ': Exit('n');
  386.     'ɴ': Exit('N');
  387.     'ɶ': Exit('O');
  388.     'ɼ': Exit('r');
  389.     'ɽ': Exit('r');
  390.     'ɾ': Exit('r');
  391.     'ʀ': Exit('R');
  392.     'ʂ': Exit('s');
  393.     'ʈ': Exit('t');
  394.     'ʉ': Exit('u');
  395.     'ʋ': Exit('v');
  396.     'ʏ': Exit('Y');
  397.     'ʐ': Exit('z');
  398.     'ʑ': Exit('z');
  399.     'ʙ': Exit('B');
  400.     'ʛ': Exit('G');
  401.     'ʜ': Exit('H');
  402.     'ʝ': Exit('j');
  403.     'ʟ': Exit('L');
  404.     'ʠ': Exit('q');
  405.     'ʣ': Exit('d');
  406.     'ʥ': Exit('d');
  407.     'ʦ': Exit('t');
  408.     'ʪ': Exit('l');
  409.     'ʫ': Exit('l');
  410.     'ʰ': Exit('h');
  411.     'ʲ': Exit('j');
  412.     'ʳ': Exit('r');
  413.     'ʷ': Exit('w');
  414.     'ʸ': Exit('y');
  415.     'ˡ': Exit('l');
  416.     'ˢ': Exit('s');
  417.     'ˣ': Exit('x');
  418.     else Exit('?');
  419.   end;
  420. end;
  421.  
  422. function AccentedNameToAscii(aUTF8Name: String): String;
  423. var
  424.   p, pEnd: Pchar;
  425.   i, j: Integer;
  426.   s: String;
  427. begin
  428.   Result := '';
  429.   p := PChar(aUTF8Name);
  430.   pEnd := p;
  431.   Inc(pEnd, Length(aUTF8Name));
  432.   repeat
  433.     i := UTF8CodepointSize(p);
  434.     case i of
  435.       1: Result += p^;
  436.       else
  437.         begin
  438.           SetLength(s{%H-}, i);
  439.           for j := 1 to i do
  440.             begin
  441.               Inc(p, j-1);
  442.               s[j] := p^;
  443.             end;
  444.           Result += NameChrToASCII(s);
  445.         end;
  446.     end;
  447.     Inc(p);
  448.   until p >= pEnd;
  449. end;
  450.  
  451. var
  452.   strs: TStringDynArray;
  453.   s: String;
  454.  
  455. begin
  456.   strs := TStringDynArray.Create('Les Bruyères', 'Centre Médical Héliporté',
  457.                                   'Vésale Heliport', 'Saïss Airport',
  458.                                   'Fès-Boulemane', 'Léopold', 'Kédougou',
  459.                                   'Cesária', 'Évora', 'São', 'Ploče', 'Otočac',
  460.                                   'Čakovec', 'Almería', 'León', 'León',
  461.                                   'Logroño-Agoncillo', 'Suárez', 'Compiègne',
  462.                                   'Tréport', 'Périgueux', 'Targé', 'Châtellerault',
  463.                                   'Épernay', 'Pápa', 'Pécs-Pogány', 'Győr-Pér', 'Pér');
  464.   for s in strs do
  465.     Writeln(s,'  ->  "',AccentedNameToAscii(s),'"');
  466.   WriteLn(#10'Press [Enter] to finish');
  467.   ReadLn;
  468. end.
Interesting.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11351
  • FPC developer.
Re: Character Conversions
« Reply #24 on: September 21, 2019, 04:22:36 pm »
A good solution is more than a simple array. Some non Western scripts can have multiple accents per character and other special compositing solutions.

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: Character Conversions
« Reply #25 on: September 21, 2019, 11:04:07 pm »
Hi!

As Markov told me I got that bunch of file from the utf8-consortium last night. This is really near to a whole expert-system. Yes, there are a lot of rules: at the beginning of a word it this otherwise that. And before a wovel it is that otherwise this. And, and.....

So I decided to come to a single-char-solution like Howardpc, but I did a little more of work:
* Complete Latin 1 supplement
* Latin Extended A
* Greek Alphabet
* Russian Alphabet

So I think the most of Europe is done. Latin Extended B, C and D have so rare letters that it's not worth.

I made a little app that converts Utf8-textfiles to ASCII. If anybody thinks that's something not complete or he want's to add another language: Just enhance the constant array in unit utf8toAsciiConvert. Then everything is done.

As a hardrock test I just converted a csv geo database with 1.4 GB an 11 million lines. Takes some time but works fine.

The whole converting is done in the unit utf8toAsciiConvert so you can use it in other applications.

Winni

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11351
  • FPC developer.
Re: Character Conversions
« Reply #26 on: September 21, 2019, 11:33:26 pm »
Does ß to ss work ?

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: Character Conversions
« Reply #27 on: September 21, 2019, 11:37:48 pm »
line 86

JLWest

  • Hero Member
  • *****
  • Posts: 1293
Re: Character Conversions
« Reply #28 on: September 22, 2019, 06:53:00 am »
I have been out for a few days (Hospital) Just got back.

WOW - WOW

Can't wait to give this a test. At this moment I have 170 files all of which have 8 fields Each file is a country file with all of the airports for that country:

|KAAA|Nil|Logan County Airport|Lincoln|Illinois|United States|
|KAAF|AAF|Apalachicola Regional Airport|Apalachicola|Florida|United States|
|KAAO|Nil|Colonel James Jabara Airport|Wichita|Kansas|United States|
|KAAS|Nil|Taylor County Airport|Campbellsville|Kentucky|United States|

The data comes from https://en.wikipedia.org/wiki/List_of_airports_by_ICAO_code:

I have to Copy and Paste and do a little prep, Then I run a program which formats to the above. Some of the files are quite small (Guam 1-Airport) but some are really big US, Germany, Spain, China, Russia Brazil.

So I'm about ready to copy out the T's from the site. Implement a test Demo and see what happens. But WOW This is great Thanks.

TA - Antigua and Barbuda
    TAPA (ANU) – VC Bird International Airport – Saint John's, Antigua
    TAPH (BBQ) – Codrington Airport – Codrington, Barbuda
    TAPT – Coco Point Lodge Airport – Coco Point, Barbuda
TB - Barbados
    TBPB (BGI) – Grantley Adams International Airport – Bridgetown
    TBPO – Bridgetown Heliport – Bridgetown (closed)
TD - Dominica
    TDCF (DCF) – Canefield Airport – Roseau
    TDPD (DOM) – Melville Hall Airport – Marigot
TF - Guadeloupe
    TFFA (DSD) – La Désirade Airport – Beauséjour, La Désirade
    TFFB (BBR) – Baillif Airport – Baillif, Basse-Terre
    TFFC (SFC) – Saint-François Airport – Saint-François, Grande-Terre
    TFFM (GBJ) – Marie-Galante Airport – Grand-Bourg, Marie-Galante
    TFFR (PTP) – Pointe-à-Pitre - Le Raizet Airport – Pointe-à-Pitre, Grande-Terre
    TFFS (LSS) – Les Saintes Airport – Terre-de-Haut, Les Saintes
Martinique
    TFFF (FDF) – Fort-de-France - Le Lamentin Airport – Le Lamentin, Fort-de-France
    TFFJ (SBH) – Gustaf III Airport – St. Jean
Saint Martin (France)
    TFFG (SFG) – L'Espérance Airport – Grand Case
TG - Grenada
    TGPG – Pearls Airport – Grenville
    TGPY (GND) – Maurice Bishop International Airport – St. George's
    TGPZ (CRU) – Lauriston Airport (Carriacou Island Airport) – Hillsborough, Carriacou Island
 
FPC 3.2.0, Lazarus IDE v2.0.4
 Windows 10 Pro 32-GB
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
4.1 TB

JLWest

  • Hero Member
  • *****
  • Posts: 1293
Re: FIRST TEST Character Conversions
« Reply #29 on: September 22, 2019, 08:35:07 am »

Input File:

TFFA (DSD) - La Désirade Airport - Beauséjour - La Désirade
    TFFB (BBR) - Baillif Airport - Baillif - Basse-Terre
    TFFC (SFC) - Saint-François Airport - Saint-François - Grande-Terre
    TFFM (GBJ) - Marie-Galante Airport - Grand-Bourg - Marie-Galante
    TFFR (PTP) - Pointe-à-Pitre - Le Raizet Airport - Pointe-à-Pitre - Grande-Terre
    TFFS (LSS) - Les Saintes Airport - Terre-de-Haut - Les Saintes

Output:

TFFA (DSD) - La D?sirade Airport - Beaus?jour - La D?sirade
    TFFB (BBR) - Baillif Airport - Baillif - Basse-Terre
    TFFC (SFC) - Saint-Fran?ois Airport - Saint-Fran?ois - Grande-Terre
    TFFM (GBJ) - Marie-Galante Airport - Grand-Bourg - Marie-Galante
    TFFR (PTP) - Pointe-?-Pitre - Le Raizet Airport - Pointe-?-Pitre - Grande-Terre
    TFFS (LSS) - Les Saintes Airport - Terre-de-Haut - Les Saintes

It didn't seem to convert anything;

Am I doing something wrong?

It's after 11 here, Have to run it thru the debugger tomorrow.

FPC 3.2.0, Lazarus IDE v2.0.4
 Windows 10 Pro 32-GB
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
4.1 TB

 

TinyPortal © 2005-2018