Recent

Author Topic: Character Conversions  (Read 4227 times)

JLWest

  • Hero Member
  • *****
  • Posts: 595
Character Conversions
« on: September 19, 2019, 08:13:55 pm »
Trying to convert Non-ASCII words to ASCII words without success. Most of the words will be UTF8 but some of it is Greek, Arabic and who knows what. About 11 to 15 mill words. Some are already ASCII.

I dont expect anyone to write the code just give me an idea where to start.

I have read quite a bit on this but I don't understand how to implement it in code.


Code: Pascal  [Select]
  1. function TForm1.ToASCII(ASTRING : String) : String;
  2.  Var AWord : AnsiString;
  3.  Begin
  4.   AWord : AString;
  5.  
  6.   What has to go here to acheive this;
  7.  
  8.   Result := AWord;
  9.  end;
FPC 3.0.4, Lazarus IDE v2.0.6
 Windows 10 Pro
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
4.1 TB

Handoko

  • Hero Member
  • *****
  • Posts: 3186
  • My goal: build my own game engine using Lazarus
Re: Character Conversions
« Reply #1 on: September 19, 2019, 08:30:20 pm »
I think I can understand what you said. But can you please provide examples what are the inputs and the outputs. So I can work based on your examples.

winni

  • Sr. Member
  • ****
  • Posts: 432

JLWest

  • Hero Member
  • *****
  • Posts: 595
Re: Character Conversions
« Reply #3 on: September 19, 2019, 09:00:18 pm »
I think I can understand what you said. But can you please provide examples what are the inputs and the outputs. So I can work based on your examples.

Input:                                            Output
Les Bruyères                                Les Bruyeres
Centre Médical Héliporté               Centre Medical Heliporte
Vésale Heliport
Saïss Airport
Fès-Boulemane
Léopold
Kédougou
Cesária
Évora
São
Ploče
Otočac
Čakovec
Almería
León
León
Logroño-Agoncillo
Suárez
Compiègne
Tréport
Périgueux
Targé
Châtellerault
Épernay
Pápa
Pécs-Pogány
Győr-Pér
Pér


Hope this answers your question. The list on the left is the who knows and the right are ASCII
Code: Text  [Select]
  1. Les Bruyères
  2. Centre Médical Héliporté
  3. Vésale Heliport
  4. Saïss Airport
  5. Fès-Boulemane
  6. Léopold
  7. Kédougou
  8. Cesária
  9. Évora
  10. São
  11. Ploče
  12. Otočac
  13. Čakovec
  14. Almería
  15. León
  16. León
  17. Logroño-Agoncillo
  18. Suárez
  19. Compiègne
  20. Tréport
  21. Périgueux
  22. Targé
  23. Châtellerault
  24. Épernay
  25. Pápa
  26. Pécs-Pogány
  27. Győr-Pér
  28. Pér
  29.  
  30.  
  31.  
  32.  
FPC 3.0.4, Lazarus IDE v2.0.6
 Windows 10 Pro
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
4.1 TB

howardpc

  • Hero Member
  • *****
  • Posts: 3177
Re: Character Conversions
« Reply #4 on: September 19, 2019, 11:05:25 pm »
Try the following:
Code: Pascal  [Select]
  1. program Project1;
  2.  
  3. {$mode objfpc}{$H+}{$IfDef windows}
  4. {$AppType console}
  5. {$EndIf}
  6.  
  7. uses
  8.    iconvenc, Types, LazUTF8;
  9.  
  10. function ConvertToAscii(aUTF8Text: String): String;
  11. var
  12.   s, tmp: String;
  13.   p: PChar;
  14.   pEnd: PChar;
  15.   i, j: Integer;
  16. begin
  17.   Result := '';
  18.   p := PChar(aUTF8Text);
  19.   pEnd := p;
  20.   Inc(pEnd, Length(aUTF8Text));
  21.   repeat
  22.     i := UTF8CodepointSize(p);
  23.     case i of
  24.       1: Result += p^;
  25.       else
  26.         begin
  27.           SetLength(s, i);
  28.           for j := 1 to i do
  29.             begin
  30.               Inc(p, j-1);
  31.               s[j] := p^;
  32.             end;
  33.           Iconvert(s, tmp, 'UTF-8', 'ASCII//TRANSLIT');
  34.           Result += tmp[1];
  35.         end;
  36.     end;
  37.     Inc(p);
  38.   until p >= pEnd;
  39. end;
  40.  
  41. var
  42.   strs: TStringDynArray;
  43.   s: String;
  44.  
  45. begin
  46.   strs := TStringDynArray.Create('Les Bruyères', 'Centre Médical Héliporté',
  47.                                  'Vésale Heliport', 'Saïss Airport',
  48.                                  'Fès-Boulemane', 'Léopold', 'Kédougou',
  49.                                  'Cesária', 'Évora', 'São', 'Ploče', 'Otočac',
  50.                                  'Čakovec', 'Almería', 'León', 'León',
  51.                                  'Logroño-Agoncillo', 'Suárez', 'Compiègne',
  52.                                  'Tréport', 'Périgueux', 'Targé', 'Châtellerault',
  53.                                  'Épernay', 'Pápa', 'Pécs-Pogány', 'Győr-Pér', 'Pér');
  54.   for s in strs do
  55.     WriteLn(ConvertToAscii(s));
  56.   Readln;
  57. end.

It gives the following output on Linux (not tested on Windows):
Code: Pascal  [Select]
  1. Les Bruyeres
  2. Centre Medical Heliporte
  3. Vesale Heliport
  4. Saiss Airport
  5. Fes-Boulemane
  6. Leopold
  7. Kedougou
  8. Cesaria
  9. Evora
  10. Sao
  11. Ploce
  12. Otocac
  13. Cakovec
  14. Almeria
  15. Leon
  16. Leon
  17. Logrono-Agoncillo
  18. Suarez
  19. Compiegne
  20. Treport
  21. Perigueux
  22. Targe
  23. Chatellerault
  24. Epernay
  25. Papa
  26. Pecs-Pogany
  27. Gyor-Per
  28. Per
« Last Edit: September 19, 2019, 11:09:43 pm by howardpc »

Birger52

  • Full Member
  • ***
  • Posts: 112
Re: Character Conversions
« Reply #5 on: September 20, 2019, 12:12:58 am »
Why?

ASCII only has 127 charaters.
http://www.asciitable.com/
So what you want is not possible.

Some extended "ASCII character" sets, has some "foreign" characters - but they are not strict ASCII, (8 bit in contrast to ASCII's 7 - the extra character nbrs 128-255) and you will need to interpret the result according to witch set has the character(s) you want.
(Code pages - https://en.wikipedia.org/wiki/Code_page or https://docs.microsoft.com/en-us/windows/win32/intl/code-page-identifiers)

JLWest

  • Hero Member
  • *****
  • Posts: 595
Re: Character Conversions
« Reply #6 on: September 20, 2019, 12:59:28 am »


Ok
Lets Say up to 255, the extended character set.

Why is a long Story.

I'm Trying to work with some very large (11 Million Lines) and old (20 years)  text files. They were submitted by users from all over the world with all kinds of characters sets. They don't display right, sort right and if statements don't always work.

If Parm = 'X' then do-sopmething; This statement didn't work in a program because Parm displays as an 'X' but is a different character set. I edit the file and can make it work but that's not a a solution on 11 mil lines.

Have to convert to something.

The Parm value was read in from file. 

" Var
FPC 3.0.4, Lazarus IDE v2.0.6
 Windows 10 Pro
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
4.1 TB

winni

  • Sr. Member
  • ****
  • Posts: 432
Re: Character Conversions
« Reply #7 on: September 20, 2019, 01:09:51 am »
@Birger52

Yes, that's all true.

But the question was to do the "imposiible".

There are situations where pure ASCII-7bit is needed. And nothing else. Most times because of old  Software. In Germany some Banks are using still today software, which is not able to handle äöüÄÖÜ! A customer called "Müller" is allways printed as "M ller""!!!!

So for Europe it is the question how to get rid of all those little specialities above and sometimes  below the characters. To make it readable and not to replace it with a gap.

Because of this reasons there a geo databases around with two ( or more) fields for the name of a city. Field1 is Local Name in UTF8. Second Field2 is Name in ASCII.

Hope this clarifies the situation.

Winni

« Last Edit: September 20, 2019, 01:11:28 am by winni »

dbannon

  • Hero Member
  • *****
  • Posts: 754
    • tomboy-ng, a rewrite of the classic Tomboy
Re: Character Conversions
« Reply #8 on: September 20, 2019, 03:49:14 am »
A customer called "Müller" is allways printed as "M ller""!!!!

Indeed. Now, you could, for example, replace Müller with Muller by having a look up table that replaces unicode characters with an acceptable approximation. By if I was Mr Müller I think I'd be even more upset, leaving a space sort of acknowledges its wrong, using a "u" is renaming that person. Thats what Iconvert() does perhaps ?

And, there are very, very many unicode characters that don't have a reasonable approximation at all. So, you end up with text that contains, eg a "?" or a space as winni mentions.  Sort of readable, it would be sortable and consistent. But ugly.

Spotting the UTF8 characters is easy, see https://wiki.freepascal.org/UTF8_strings_and_characters - its a policy decision what to do with them, not a coding one.

Davo
Lazarus 2, Linux (and reluctantly Win10, OSX)
My Project - https://github.com/tomboy-notes/tomboy-ng

howardpc

  • Hero Member
  • *****
  • Posts: 3177
Re: Character Conversions
« Reply #9 on: September 20, 2019, 05:09:19 am »
Note that the code I cobbled together is for valid utf8 text. Visual inspection of the small data sample you provided showed it was OK.
For unseen data from unknown sources (such as processing text served from an online database) a more robust, but slower, routine would need to insert
Code: Pascal  [Select]
  1. UTF8FixBroken(aUTF8Text);
 

as the first line.
Of course output from invalid utf8 text is at best ?? and at worst simply garbage. However, the routine should not crash if fed unsuitable data.
« Last Edit: September 20, 2019, 05:11:58 am by howardpc »

JLWest

  • Hero Member
  • *****
  • Posts: 595
Re: Character Conversions
« Reply #10 on: September 20, 2019, 07:03:31 am »
Note that the code I cobbled together is for valid utf8 text. Visual inspection of the small data sample you provided showed it was OK.
For unseen data from unknown sources (such as processing text served from an online database) a more robust, but slower, routine would need to insert
Code: Pascal  [Select]
  1. UTF8FixBroken(aUTF8Text);
 

as the first line.
Of course output from invalid utf8 text is at best ?? and at worst simply garbage. However, the routine should not crash if fed unsuitable data.


I can't get the code to compile.

It gives me an error on line 33.
  Iconvert(s, tmp, 'UTF-8', 'ASCII//TRANSLIT');  <-- can't find this

It appears to be a Unix thing.

Any ideas.
FPC 3.0.4, Lazarus IDE v2.0.6
 Windows 10 Pro
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
4.1 TB

howardpc

  • Hero Member
  • *****
  • Posts: 3177
Re: Character Conversions
« Reply #11 on: September 20, 2019, 10:09:13 am »
For Windows you could try the open source GnuWin32 library, which provides dlls which include iconv as on Linux.
I have not tried this myself, but see here
There may be some other built-in Windows solution that I am not aware of. These days I only use Windows if I am forced to.
However, the majority of forum users are Windows users (well, the majority of Lazarus/fpc downloaders are Windows users which would tend to indicate a similar ratio for the forum), and so others may offer simpler solutions.
« Last Edit: September 20, 2019, 10:14:43 am by howardpc »

Birger52

  • Full Member
  • ***
  • Posts: 112
Re: Character Conversions
« Reply #12 on: September 20, 2019, 11:26:46 am »
I still don't get it.

You can not convert utf8 to ASCII - other than the first 127 characters.
What you want is to represent the non-ASCII characters with some representation of the character that can be done with ASCII characters, making them readable.
Like f.ex. å in danish can be represented by aa.

Seems like the way to go then, would be a table lookup, for the individual characters.
There is 1,112,064 utf8 characters, and you would probably have to create the table yourself.
Still only one tenth of having to correct all the lines manually...
;)
Could maybe be simplified by reading the string as bytes...




howardpc

  • Hero Member
  • *****
  • Posts: 3177
Re: Character Conversions
« Reply #13 on: September 20, 2019, 12:32:53 pm »
I still don't get it.

You can not convert utf8 to ASCII - other than the first 127 characters.
What you want is to represent the non-ASCII characters with some representation of the character that can be done with ASCII characters, making them readable.
Like f.ex. å in danish can be represented by aa.
You cannot "convert" the first 127 utf8 codepoints to ASCII.

They are already ASCII.

What JLWest is after is a "conversion" of other utf8 codepoints to degrade them to look as close as possible to a single existing ASCII character.
Obviously it is not possible to do this with most of the unicode range. That is a given.

So in this limited exercise, the Danish  å would become a (not aa).The emoji  %) would be omitted (or simply produce garbage).

But that does not matter. It was never intended that this exercise should include emojis or countless other codepoints that lack an obvious ASCII "equivalent".


bytebites

  • Full Member
  • ***
  • Posts: 213
Re: Character Conversions
« Reply #14 on: September 20, 2019, 03:36:00 pm »