Recent

Author Topic: ASCII Char set Questions  (Read 727 times)

JLWest

  • Hero Member
  • *****
  • Posts: 573
ASCII Char set Questions
« on: May 17, 2019, 05:46:10 am »
I need the algorithm for converting multi characters to ASCII.

I still have problems with data from the Apt.Dat file from X-Plane 11.

As it turns out it all kinds of mixed character sets.

In the early days of X-Planes there were only Computer generated airports.  An asphalt strip with the correct length more or less and the center line aligned correctly. X-Plane realized they would never go anywhere without airports like the competition.

So they designed a program called WED. WED designs 3-D Airports. Produces the Apt.Dat file for the airport, an XML file and another file. All of this can be submitted to the X-Plane 11 gateway. They accept all and chose the best. The best will become a part of the next release.

Problem: The Airport names are converted to ASCII, but not the the City or Country. They are in what ever character set the airport was was submitted in. Greek, German, English, Whatever.

Therefore, I can't get it to collate right and can't do validations.

So at a min. I need to convert those two data fields (items) to ASCII.

What is the algorithm to do that:

I'm willing to do the research and try to write the code.

I think it's maybe:

1.  Determine the code page, (How I don't know)
2.  Character by character comparison swamping out character as needed to ASCII.
 
Just need an idea of the algorithm.

Thanks


 

JLWEST
Lazuras ver 2.0.2 
 FPC 3.0.4, Lazarus IDE v1.8.2 Windows 10 Pro
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
3952 GB (1.5 SSD)

engkin

  • Hero Member
  • *****
  • Posts: 2513
Re: ASCII Char set Questions
« Reply #1 on: May 17, 2019, 06:10:59 am »
I need the algorithm for converting multi characters to ASCII.

Use SetCodePage:
Code: Pascal  [Select]
  1. program Project1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. var
  6.   s: String;
  7.  
  8. begin
  9.   s := 'abc';
  10.   SetCodePage(RawByteString(s), CP_ASCII, True);
  11. end.

Edit:
Try reading this.
« Last Edit: May 17, 2019, 06:16:34 am by engkin »

lucamar

  • Hero Member
  • *****
  • Posts: 2020
Re: ASCII Char set Questions
« Reply #2 on: May 17, 2019, 06:26:53 am »
What you want is almost impossible. A string in a Windows codepage is identical to a string in any other codepage, in the sense that they both may contain characters from #128 to #255 that have different meanings.

Yo would have to invent an algorithm to guess the encoding: for example traversing each string trying to find "forbidden" combinations (e.g. an "ñ" cannot be between consonants so the string isn't in CP-1252) and even that wouldn't be sure: your strings are too short.

One alternative (that I've never used, so don't know how well it works) is to use GuessEncoding() with ConvertEncodingToUTF8() (both in unit LConvEncoding of the LazUtils package) but that would convert the strings to UTF8, not ASCII.
But if you do that, let's know how it worked! :)

BTW -- In case you wanted to know -- transliteration to latin characters are usually made by approximation of syllable/word sound. That's how we got words like "otaku", "manga", "kung-fu", etc. :)
« Last Edit: May 17, 2019, 06:34:35 am by lucamar »
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus 2.0.2/2.0.4  - FPC 3.0.4 on:
(K|L)Ubuntu 12..16, Windows XP SP3, various DOSes.

engkin

  • Hero Member
  • *****
  • Posts: 2513
Re: ASCII Char set Questions
« Reply #3 on: May 17, 2019, 06:37:00 am »
Problem: The Airport names are converted to ASCII, but not the the City or Country. They are in what ever character set the airport was was submitted in. Greek, German, English, Whatever.

They did not convert the City and Country because ASCII can represent two languages (English and some other language) only. You can not use ASCII when you deal with more than two languages. ASCII can not be used to write a sentence that has English, Greek, and Hebrew, for instance. Not even Greek and Hebrew alone.  :(

engkin

  • Hero Member
  • *****
  • Posts: 2513
Re: ASCII Char set Questions
« Reply #4 on: May 17, 2019, 06:49:09 am »
Therefore, I can't get it to collate right and can't do validations.

Any small example?

lucamar

  • Hero Member
  • *****
  • Posts: 2020
Re: ASCII Char set Questions
« Reply #5 on: May 17, 2019, 07:06:51 am »
You can not use ASCII when you deal with more than two languages. ASCII can not be used to write a sentence that has English, Greek, and Hebrew, for instance. Not even Greek and Hebrew alone.  :(

Yes, it can if you transliterate the words/sounds. That's (mostly) how you get Athenas, Jerusalem, Moscow, Pekin, etc.

Whether that's what JLWest wants is another question ...

Note also that he's not dealing with full sentences but just a few foreign names which probably have already a common English designation.

The problem really is guessing, from a random encoding, to which name it corresponds, isn't it JL?
« Last Edit: May 17, 2019, 07:11:59 am by lucamar »
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus 2.0.2/2.0.4  - FPC 3.0.4 on:
(K|L)Ubuntu 12..16, Windows XP SP3, various DOSes.

engkin

  • Hero Member
  • *****
  • Posts: 2513
Re: ASCII Char set Questions
« Reply #6 on: May 17, 2019, 07:12:56 am »
You are not helping.  :(

Transliterating Greek or Hebrew is not equal to using Greek or Hebrew. He needs to understand the limitation of ASCII (and its codepages) to accept Unicode (UTF8/16/32).

Edit:
Transliteration by itself creates a problem. Some (many?) sounds do not exist in English and can not be transliterated using A..Z.
« Last Edit: May 17, 2019, 07:15:50 am by engkin »

lucamar

  • Hero Member
  • *****
  • Posts: 2020
Re: ASCII Char set Questions
« Reply #7 on: May 17, 2019, 07:32:57 am »
You are not helping.  :(

Transliterating Greek or Hebrew is not equal to using Greek or Hebrew. He needs to understand the limitation of ASCII (and its codepages) to accept Unicode (UTF8/16/32).

I'm trying to help.

Look, as I read it the problem is that, for example, some kind soul added an airport in "Москва" (encoded in whatever page). Once collated, that should be close to the one in, say, "Madrid" but because the encoding it appears gods know where.

One solution? Convert it to "Moscow". Or leave it as is. But to do that you have to know which encoding the string has. If it were Unicode there wouldn't be any problem, so they are not Unicode. How do you guess whether it is KOI-8 or the cyrillic CP or not even Russian to begin with?
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus 2.0.2/2.0.4  - FPC 3.0.4 on:
(K|L)Ubuntu 12..16, Windows XP SP3, various DOSes.

JLWest

  • Hero Member
  • *****
  • Posts: 573
Re: ASCII Char set Questions
« Reply #8 on: May 17, 2019, 07:36:31 am »
I need the algorithm for converting multi characters to ASCII.

Use SetCodePage:
Code: Pascal  [Select]
  1. program Project1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. var
  6.   s: String;
  7.  
  8. begin
  9.   s := 'abc';
  10.   SetCodePage(RawByteString(s), CP_ASCII, True);
  11. end.

Edit:
Try reading this.

That readme you suggested was informative. Didn't understand it all 'but' I'm closer.
I'll go thru it two or three more times.

I'm still wondering why I couldn't just build a look routine.

Take a string:   AString := BString
  • ;

                   // determine if AString is in  'A'..'Z' or 'a'..'z'  (if so good)
                                                 no match
                  AZString array of String;

               ZString : AZString;

               ZString[1]   := '¢';
                ZString[2] := 'í';
   
        look AString up in a array of ZString

Mst be missing something.
 


 
JLWEST
Lazuras ver 2.0.2 
 FPC 3.0.4, Lazarus IDE v1.8.2 Windows 10 Pro
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
3952 GB (1.5 SSD)

JLWest

  • Hero Member
  • *****
  • Posts: 573
Re: ASCII Char set Questions
« Reply #9 on: May 17, 2019, 08:06:44 am »
Therefore, I can't get it to collate right and can't do validations.

Any small example?

Yea I can give an example.

 I just corrected it by hand editing a 7.9 million data file. If I do it by hand editing there are 35,242 Cities and 35,242 Countries. Obviously there are a lot of repeats, but each one would have to be checked. And I'm two versions behind in X-Plane and I want on the latest as they have 'True Earth Scenery' or 'Global Earth' availability for the latest version.  Both are so real it's unbelievable.



One was Réunion. It shows in the listbox after loading from the text file as R??uion. It's the name of a country.

I need to validate it as Reunion, display it as Reunion or
validate it as Réunion and display it as Réunion but R??uion is unusable.

I suppose I could have a:

if x = 'R??uion' then x = 'Réunion'
else if ... 35,241 more cities to go.

Just how many if then else's will fPC handle.

Maybe write a program to read the 7.9 million file and pull out the cities and countries.
They are all designated as:

1302 country United States
1302 city Los Angeles

1302 country  'Réunion'
1302 city Whatever

Sort, remove duplicates, and somehow validate it the array. 

 
JLWEST
Lazuras ver 2.0.2 
 FPC 3.0.4, Lazarus IDE v1.8.2 Windows 10 Pro
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
3952 GB (1.5 SSD)

JLWest

  • Hero Member
  • *****
  • Posts: 573
Re: ASCII Char set Questions
« Reply #10 on: May 17, 2019, 08:12:34 am »
@lucumar
That's it.

If it came from the US no problem.
but Germany, Isreal, Saudi, Greek, Colombia, 240 different countries all with their own code page, character set and codepoint.

I'll try a small example with GuessEncoding() and maybe try to go to UTF8.
JLWEST
Lazuras ver 2.0.2 
 FPC 3.0.4, Lazarus IDE v1.8.2 Windows 10 Pro
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
3952 GB (1.5 SSD)

lucamar

  • Hero Member
  • *****
  • Posts: 2020
Re: ASCII Char set Questions
« Reply #11 on: May 17, 2019, 09:01:49 am »
I'll try a small example with GuessEncoding() and maybe try to go to UTF8.

Don't bother. I just tested (which I should have done earlier) and there isn't a single string in APT.DAT that Isn't UTF8, at least as far as GuessEncoding() can tell.

Your problem must be other than the file encoding.

One was Réunion. It shows in the listbox after loading from the text file as R??uion. It's the name of a country.

I checked those (two) lines: they are UTF8, as they should.

How are you reading the file? That may be the problem. Reading an UTF8 file into a listbox shouldn't produce that result.
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus 2.0.2/2.0.4  - FPC 3.0.4 on:
(K|L)Ubuntu 12..16, Windows XP SP3, various DOSes.

JLWest

  • Hero Member
  • *****
  • Posts: 573
Re: ASCII Char set Questions
« Reply #12 on: May 17, 2019, 07:53:50 pm »
I'll post the program on my GDrive along with the data file. But  here is the rocedure that reads the apt.dat file.   declared a DataFile  : TextFile; then  Readln(DataFile, LineIn);
then I run it thru a case statement to figure out if I need to save the info to a listbox.

At the end I write the listbox to a file.

https://drive.google.com/open?id=1cT-hIDOrKvbUPVPy01a1X2Ui1OQLdRYG

https://drive.google.com/open?id=1xff9naSrqr0o7CV4O2NrgE9ZzobTeg8l


Code: Pascal  [Select]
  1. Procedure TForm1.BldListbox2WitAptDotDat(aFile : String);
  2.   Var
  3.    DataFile    : TextFile;
  4.    Item        : String     = '';
  5.    LineIn      : String     = '';
  6.    TDI         : TSTYPE     = tNil;
  7.    ID          : String[4]  = '';
  8.    iHash       : Integer    = 0;
  9.    Process     : Boolean    = False;
  10.    ICAO        : String[8]  = '';
  11.    ThisICAO    : String[8]  = 'Nil';
  12.    SW100       : Boolean = False;
  13.    Bit1        : String     = '';
  14.    Bit2        : String     = '';
  15.    FMT         : String     = '';
  16.    Begin
  17.     if Not FileExists(aFile) then begin exit; end;
  18.     AssignFile(DataFile, aFile);
  19.     try
  20.      Reset(DataFile);
  21.     while not eof(DataFile) do begin
  22.       Readln(DataFile, LineIn);
  23.       INC(iHash);
  24.       LineIn := Trim(LineIn);
  25.       ID := Copy2Space(LineIn);
  26.       Process := CheckLine(LineIn);
  27.       if Not Process then begin Continue; end;
  28.       Bit1 := IntToStr(iHash);
  29.       ICAO := ExtractWord(5,LineIn,[' ']);
  30.       ICAO :=Trim(ICAO);
  31.  
  32.       Case ID of
  33.  
  34.        '1'   : begin
  35.                 TDI := tLand;
  36.                 SW100 := False;
  37.                 FMT := FormatRCD(ICAO, TDI, Bit1, LineIn);
  38.                 AddToListbox2(TDI,FMT);
  39.                 ThisICAO  := ICAO;
  40.                 Continue;
  41.                end;
  42.  
  43.        '14'  : begin   {Tower Lat and Lon}
  44.                 TDI :=  T14RCDLat;
  45.                 Bit2 := ExtractWord(2,LineIn,[' ']);
  46.                 Bit1 := 'T14RCDLat ' + Bit2;
  47.                 FMT := FormatWorkLine(TDI, ThisICAO, Item, Bit1);
  48.                 AddToListbox2(TDI,FMT);
  49.                 TDI :=  T14RCDLon;
  50.                 Bit2 := ExtractWord(3,LineIn,[' ']);
  51.                 Bit1 := 'T14RCDLon ' + Bit2;
  52.                 FMT := FormatWorkLine(TDI, ThisICAO, Item, Bit1);
  53.                 AddToListbox2(TDI,FMT);
  54.                end;
  55.  
  56.         '16'  : begin
  57.                 TDI := tSeaBase;
  58.                 ThisICAO  := ICAO;
  59.                 SW100 := False;
  60.                 FMT := FormatRCD(ICAO, TDI, Bit1, LineIn);
  61.                 AddToListbox2(TDI,FMT);
  62.                 Continue;
  63.                end;
  64.  
  65.        '17'  : begin
  66.                 TDI := tHeliPort;
  67.                 SW100 := False;
  68.                 ThisICAO  := ICAO;
  69.                 FMT := FormatRCD(ICAO, TDI, Bit1, LineIn);
  70.                 AddToListbox2(TDI,FMT);
  71.                 Continue;
  72.                end;
  73.  
  74.       '100'  : begin   {Runway Center Line}
  75.                 if SW100 then begin Continue; end;
  76.                 SW100 := True;
  77.                 Item := ExtractWord(10,LineIn,[' ']);
  78.                 TDI := t100RCDLat;
  79.                 Bit1 := 'T100RCDLat ' + Item;
  80.                 FMT := FormatWorkLine(TDI, ThisICAO, LineIn, Bit1);
  81.                 AddToListbox2(TDI,FMT);
  82.                 TDI := t100RCDLon;
  83.                 Item := ExtractWord(11,LineIn,[' ']);
  84.                 Bit1 := 'T100RCDLon ' + Item;
  85.                 FMT := FormatWorkLine(TDI, ThisICAO, LineIn, Bit1);
  86.                 AddToListbox2(TDI,FMT);
  87.                 end;
  88.  
  89.       '101'  : begin   {Waterway Center Line for Sea Planes}
  90.                 Item := ExtractWord(5,LineIn,[' ']);
  91.                 TDI := t101RCDLat;
  92.                 Bit1 := 'T101RCDLat ' + Item;
  93.                 FMT := FormatWorkLine(TDI, ThisICAO, LineIn, Bit1);
  94.                 AddToListbox2(TDI,FMT);
  95.                 TDI := t101RCDLon;
  96.                 Item := ExtractWord(6,LineIn,[' ']);
  97.                 Bit1 := 'T101RCDLon ' + Item;
  98.                 FMT := FormatWorkLine(TDI, ThisICAO, LineIn, Bit1);
  99.                 AddToListbox2(TDI,FMT);
  100.                 end;
  101.  
  102.  
  103.       '102'  : begin
  104.                 TDI :=  tNil;
  105.                 Item := ExtractWord(3,LineIn,[' ']);
  106.                 TDI := T102RCDLat;
  107.                 Bit1 := 'T102RCDLat ' + Item;
  108.                 FMT := FormatWorkLine(TDI, ThisICAO, LineIn, Bit1);
  109.                 AddToListbox2(TDI,FMT);
  110.                 TDI := T102RCDLon;
  111.                 Item := ExtractWord(4,LineIn,[' ']);
  112.                 Bit1 := 'T102RCDLon ' + Item;
  113.                 FMT := FormatWorkLine(TDI, ThisICAO, LineIn, Bit1);
  114.                 AddToListbox2(TDI,FMT);
  115.                 end;
  116.  
  117.       '1302'  : begin
  118.                 TDI := tNil;
  119.                 TDI := GetTSType(LineIn);
  120.                   Case TDI of
  121.                    tRegion  : begin
  122.                                Item := GetItem(TDI,LineIn);
  123.                                if Item.IsEmpty then begin Continue; end;
  124.                                   FMT := FormatWorkLine(TDI, ThisICAO, LineIn, '');
  125.                                   AddToListbox2(TDI,FMT);
  126.                               end;
  127.  
  128.                    tCity    : begin
  129.                                Item := GetItem(TDI,LineIn);
  130.                                if Item.IsEmpty then begin Continue; end;
  131.                                   FMT := FormatWorkLine(TDI, ThisICAO, LineIn, '');
  132.                                   AddToListbox2(TDI,FMT);
  133.                               end;
  134.  
  135.                    tCountry : begin
  136.                                Item := GetItem(TDI,LineIn);
  137.                                if Item.IsEmpty then begin Continue; end;
  138.                                   FMT := FormatWorkLine(TDI, ThisICAO, LineIn, '');
  139.                                   AddToListbox2(TDI,FMT);
  140.                               end;
  141.  
  142.                    tLat     : begin
  143.                                Item := GetItem(TDI,LineIn);
  144.                                if Item.IsEmpty then begin Continue; end;
  145.                                   FMT := FormatWorkLine(TDI, ThisICAO, LineIn, '');
  146.                                   AddToListbox2(TDI,FMT);
  147.                               end;
  148.  
  149.                    tLon     : begin
  150.                                Item := GetItem(TDI,LineIn);
  151.                                if Item.IsEmpty then begin Continue; end;
  152.                                   FMT := FormatWorkLine(TDI, ThisICAO, LineIn,'');
  153.                                   AddToListbox2(TDI,FMT);
  154.                               end;
  155.                      tNil     : ;
  156.                  end;
  157.                 end;{end 1302}
  158.         end; {end of Case}
  159.  
  160.        Edit1.Text := IntToStr(iHash);
  161.        Edit1.Text := Format('%.0n',[StrToFloat(Edit1.Text)]);
  162.        Application.ProcessMessages;
  163.  
  164.     end;{end While Loop}
  165.      CloseFile(DataFile);
  166.     except  on E: EInOutError do  begin  ShowMessage('Error with: ' + 'apt.Dt'); end;
  167.    end;
  168.   end;                                                  
JLWEST
Lazuras ver 2.0.2 
 FPC 3.0.4, Lazarus IDE v1.8.2 Windows 10 Pro
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
3952 GB (1.5 SSD)

JLWest

  • Hero Member
  • *****
  • Posts: 573
Re: ASCII Char set Questions
« Reply #13 on: May 17, 2019, 08:17:16 pm »
Might have something working.

I have a program to generate test data. It just reads the apt.dat file loads some records into a listbox and then dumps the listbox to a file.

I modified it to get only the records I'm interested in.

I ran the program and I'm going thru the records now. So far there are no funny characters.
But I have a lot to go thru by hand before I know for sure.

Maybe?

 
JLWEST
Lazuras ver 2.0.2 
 FPC 3.0.4, Lazarus IDE v1.8.2 Windows 10 Pro
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
3952 GB (1.5 SSD)

lucamar

  • Hero Member
  • *****
  • Posts: 2020
Re: ASCII Char set Questions
« Reply #14 on: May 17, 2019, 09:19:09 pm »
I'm going to make some tests on Windows.

It might be that its something OS-specific and until now I've been testing on Linux which uses UTF8 for most anything.

Back in a jiffy!

ETA: OK, tested on Windows XP and it behaves exactly like in Linux: the strings are read as UTF8 so there should be no problems loading them to any Lazarus control.

I don't see where your problem is coming from. I'll have to investigate a little more ... when I have time.

For the record, this is the latest code I used for the test:
Code: Pascal  [Select]
  1. procedure TTestApp.DoRun;
  2. var
  3.   AFile: TextFile;
  4.   ALine, Coding: String;
  5.   Count: Int64;
  6.   Encoding: TSystemCodePage;
  7. begin
  8.   Count := 0;
  9.   AssignFile(AFile, 'apt.dat');
  10.   try
  11.     Reset(AFile);
  12.     Encoding := GetTextCodePage(AFile);
  13.     WriteLn('File CP: ', Encoding, ' = ', CodePageToCodePageName(Encoding));
  14.     try
  15.       while not EOF(AFile) do begin
  16.         ReadLn(AFile, ALine);
  17.         Inc(Count);
  18.         //Coding := GuessEncoding(ALine);
  19.         Encoding := StringCodePage(ALine);
  20.         Coding := CodePageToCodePageName(Encoding);
  21.         if {Coding <> 'utf8'} Encoding <> CP_UTF8  then begin
  22.           if Coding.IsEmpty then
  23.             Write(Count, ': Unknown - ')
  24.           else
  25.             Write(Count, ': ', Coding, ' - ');
  26.           Writeln('"', ALine, '"');
  27.         end;
  28.       end;
  29.     finally
  30.       CloseFile(AFile);
  31.     end;
  32.   except
  33.     on e: Exception do WriteLn(e.ToString);
  34.   end;
  35.   Terminate;
  36. end;
« Last Edit: May 17, 2019, 10:20:08 pm by lucamar »
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus 2.0.2/2.0.4  - FPC 3.0.4 on:
(K|L)Ubuntu 12..16, Windows XP SP3, various DOSes.