Recent

Author Topic: Character Conversions  (Read 10538 times)

bytebites

  • Hero Member
  • *****
  • Posts: 632
Re: Character Conversions
« Reply #30 on: September 22, 2019, 10:12:57 am »
Code: Pascal  [Select][+][-]
  1. function toascii(s: string): string;
  2. type
  3.   USASCIIString = type ansistring(20127);
  4. begin
  5.   Result := USASCIIString(s);
  6. end;
  7.  

from stackoverflow

munair

  • Hero Member
  • *****
  • Posts: 798
  • compiler developer @SharpBASIC
    • SharpBASIC
Re: Character Conversions
« Reply #31 on: September 22, 2019, 11:26:43 am »
Unicode systems are generally utf16, but that is not really the problem.

That would be Windows only then. Unix based OSs and networks (internet) primarily use UTF8. Quote from wikipedia:
Quote
UTF-16 is used internally by systems such as Windows, Java and JavaScript. It is also often used for plain text and for word-processing data files on Windows. It is rarely used for files on Unix/Linux or macOS. It never gained popularity on the web, where UTF-8 is dominant (and considered "the mandatory encoding for all [text]" by WHATWG[2]). UTF-16 is used by under 0.01% of web pages themselves.

That said, I wonder how much money and effort has been put in software development to correctly support Unicode. Problems already start with lower and upper case conversion, especially for non-Latin languages. The definition of graphemes is sometimes vague and can even lead to controversies:
Quote
Unicode, in intent, encodes the underlying characters—graphemes and grapheme-like units—rather than the variant glyphs (renderings) for such characters. In the case of Chinese characters, this sometimes leads to controversies over distinguishing the underlying character from its variant glyphs

The Unicode system has become so complex that the size of the consortium's basic support library takes up more than 60MB (sounds like a tiny OS in its own right). The only languages that are still guaranteed to render and convert correctly are those which writing systems are covered by basic ASCII, such as Dutch and English. These are also the most efficient as they use 1 byte per character in UTF8, which is non-trivial for network communications.
keep it simple

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: Character Conversions
« Reply #32 on: September 22, 2019, 07:15:54 pm »
@JLWest

No, you are doing nothing wrong. I made a mistake in the replace function. Was too late last night ...

As attachment you get the correct version of utf8toAscii.

Winni

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11382
  • FPC developer.
Re: Character Conversions
« Reply #33 on: September 22, 2019, 07:27:22 pm »
Unicode systems are generally utf16, but that is not really the problem.

That would be Windows only then. Unix based OSs and networks (internet) primarily use UTF8. Quote from wikipedia:
Quote
UTF-16 is used internally by systems such as Windows, Java and JavaScript. It is also often used for plain text and for word-processing data files on Windows. It is rarely used for files on Unix/Linux or macOS. It never gained popularity on the web, where UTF-8 is dominant (and considered "the mandatory encoding for all [text]" by WHATWG[2]). UTF-16 is used by under 0.01% of web pages themselves.

Java, Mono and QT also exist on *nix, and are afaik primarily UTF-16.    Document encoding is something totally different from API encodings, so less relevant.

Quote
That said, I wonder how much money and effort has been put in software development to correctly support Unicode. Problems already start with lower and upper case conversion, especially for non-Latin languages. The definition of graphemes is sometimes vague and can even lead to controversies:
Quote
Unicode, in intent, encodes the underlying characters—graphemes and grapheme-like units—rather than the variant glyphs (renderings) for such characters. In the case of Chinese characters, this sometimes leads to controversies over distinguishing the underlying character from its variant glyphs

The Unicode system has become so complex that the size of the consortium's basic support library takes up more than 60MB (sounds like a tiny OS in its own right). The only languages that are still guaranteed to render and convert correctly are those which writing systems are covered by basic ASCII, such as Dutch and English. These are also the most efficient as they use 1 byte per character in UTF8, which is non-trivial for network communications.

Windows has all this in well defined APIs. *nix has iconv, but how it exactly works and which encodings it knows is less evident, and that is about it.

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: Character Conversions
« Reply #34 on: September 22, 2019, 07:28:27 pm »
@bytebites

Strange idea but just tested:

all utf8 chars are printed as ?

No, it does not work.

Winni

JLWest

  • Hero Member
  • *****
  • Posts: 1293
Re: Character Conversions
« Reply #35 on: September 22, 2019, 09:18:46 pm »
I just tested it on a large file and it worked perfect.

This is great.
FPC 3.2.0, Lazarus IDE v2.0.4
 Windows 10 Pro 32-GB
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
4.1 TB

Thaddy

  • Hero Member
  • *****
  • Posts: 14197
  • Probably until I exterminate Putin.
Re: Character Conversions
« Reply #36 on: September 23, 2019, 07:03:55 am »
@bytebites

Strange idea but just tested:

all utf8 chars are printed as ?

No, it does not work.

Winni
Depends. Here's a demo, you are partially right and JLWest is wrong
Code: Pascal  [Select][+][-]
  1. {$mode delphi}{$H+}
  2. // this may fool you into thinking it always works!
  3. // make sure you prepare a [b]file[/b] in an [b]Ansi[/b] encoding
  4. // that supports French. Then it doesn't work every time.
  5. // it doesn't work at all for non-western code pages.
  6. const strings:string =
  7. 'TFFA (DSD) - La Désirade Airport - Beauséjour - La Désirade'+LineEnding+
  8. 'TFFB (BBR) - Baillif Airport - Baillif - Basse-Terre'+LineEnding+
  9. 'TFFC (SFC) - Saint-François Airport - Saint-François - Grande-Terre'+LineEnding+
  10. 'TFFM (GBJ) - Marie-Galante Airport - Grand-Bourg - Marie-Galante'+LineEnding+
  11. 'TFFR (PTP) - Pointe-à-Pitre - Le Raizet Airport - Pointe-à-Pitre - Grande-Terre'+LineEnding+
  12. 'TFFS (LSS) - Les Saintes Airport - Terre-de-Haut - Les Saintes';
  13.  
  14. function toascii(s: string): string;
  15. type
  16.   USASCIIString = type ansistring(20127);
  17. begin
  18.   Result := USASCIIString(s);
  19. end;
  20.  
  21. begin
  22.   writeln(ToAscii(Strings));
  23. end.

If you run this, it works, but the test is flawed because I used the capabilities of the editor.
if you run the same code using a text file the ???? start to appear. (Try a Lithuanian encoding - windows-1257 - still western, but with some twists in decoration, see Marco's remark - , or KOI-8, with which many forum users are familiar with)

It is still a very useful function, but not perfect. But it is actually short and pretty concise, which I like, except for Lithuanian.... so I have to read my wife's letters by guessing the question marks... ::)
« Last Edit: September 23, 2019, 07:15:31 am by Thaddy »
Specialize a type, not a var.

JLWest

  • Hero Member
  • *****
  • Posts: 1293
Re: Character Conversions
« Reply #37 on: September 23, 2019, 04:00:08 pm »
@Thaddy

I'm just now testing bytebites function. The one that I said works was Winni function. I should get to howardpc code sometime later today or tomorrow.

I don't expect any of these functions to convert everything. With 11 million lines of texts files 90% would be fantastic.

It would be nice to have the ability to add characters that come back ?.
FPC 3.2.0, Lazarus IDE v2.0.4
 Windows 10 Pro 32-GB
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
4.1 TB

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: Character Conversions
« Reply #38 on: September 23, 2019, 04:57:15 pm »
@thaddy

Your code does not work for me! Not writing on the console, not writing to a file, not using showMessage. The result is allways the same: the french spec chars are returned as '?'.

Linux, gtk2, KDE Plasma, Lazarzus 2.04, fpc 3.01

@JLWest

Yes, you can enlarge the constant array ar in the unit utf8toAsciiConvert. Every entry looks like that:
Code: Pascal  [Select][+][-]
  1. (u: '€'; a: 'EUR'),

u is the utf8-instring, a is the Ascii-outstring.

Dont forget to change then constant length in the array header!

Winni

Thaddy

  • Hero Member
  • *****
  • Posts: 14197
  • Probably until I exterminate Putin.
Re: Character Conversions
« Reply #39 on: September 23, 2019, 06:43:59 pm »
@thaddy
Your code does not work for me! Not writing on the console, not writing to a file, not using showMessage. The result is allways the same: the french spec chars are returned as '?'.
That was the intention of my example......
Specialize a type, not a var.

JLWest

  • Hero Member
  • *****
  • Posts: 1293
Re: Character Conversions
« Reply #40 on: September 23, 2019, 06:53:07 pm »
@Thaddy

"so I have to read my wife's letters by guessing the question marks."

No you don't, you don't even have to open the letters, Just send her another thousand.

FPC 3.2.0, Lazarus IDE v2.0.4
 Windows 10 Pro 32-GB
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
4.1 TB

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: Character Conversions
« Reply #41 on: September 23, 2019, 06:58:58 pm »
@ Thaddy: Why?

The was no FreeAndNil in my code!


JLWest

  • Hero Member
  • *****
  • Posts: 1293
Re: Character Conversions
« Reply #42 on: September 25, 2019, 11:02:19 pm »
    function toascii(s: string): string;
    type
      USASCIIString = type ansistring(20127);
    begin
      Result := USASCIIString(s);
    end;
     
Using the above function I ran 24 files using this translation function.

The program produced 227 files (One per country).
Here is a list of the countries that did not translate:

Argentina.txt, Brazil.txt, Canada.txt, FaroeIslands.txt, Germany.txt, Iceland.txt, Italy.txt
Maldives.txt, Moldova.txt, Romania.txt, Thailand.txt, Tonga.txt, Uruguay.txt, Vietnam.txt

The 227 files have thousands of lines of data consisting of airport names cities states countries.
About 90 - 95 percent of the worlds airports.

I don't know what character sets it failed on. That is what character set dose Argentina have?

Actually I think that little 1 line function did fantastic.

@Thaddy

The text you refer to are airports in Guadeloupe. This is what I have in the TFile.txt and is put thru the translation. As you can see there are no non ASCII characters.

|*|Guadeloupe|3|City|Guadeloupe.txt|X|
    TFFA (DSD)|La Desirade Airport|Beausejour|La Desirade
    TFFB (BBR)|Baillif Airport|Baillif|Basse-Terre
    TFFC (SFC)|Saint-Francois Airport|Saint-Francois|Grande-Terre
    TFFM (GBJ)|Marie-Galante Airport|Grand-Bourg|Marie-Galante
    TFFR (PTP)|Pointe-a-Pitre|Le Raizet Airport|Pointe-a-Pitre

This is what's was copied from https://en.wikipedia.org/wiki/List_of_airports_by_ICAO_code:_T
Guadeloupe Definately have non ASCII characters.

Also see airport category and list.

    TFFA (DSD) – La Désirade Airport – Beauséjour, La Désirade
    TFFB (BBR) – Baillif Airport – Baillif, Basse-Terre
    TFFC (SFC) – Saint-François Airport – Saint-François, Grande-Terre
    TFFM (GBJ) – Marie-Galante Airport – Grand-Bourg, Marie-Galante
    TFFR (PTP) – Pointe-à-Pitre - Le Raizet Airport – Pointe-à-Pitre, Grande-Terre
    TFFS (LSS) – Les Saintes Airport – Terre-de-Haut, Les Saintes

This is what is in my Guadeloupe.txt file. 'I Cant explain?'

|TFFA|DSD|La Desirade Airport|Beausejour|La Desirade|Guadeloupe|
|TFFB|BBR|Baillif Airport|Baillif|Basse-Terre|Guadeloupe|
|TFFC|SFC|Saint-Francois Airport|Saint-Francois|Grande-Terre|Guadeloupe|
|TFFM|GBJ|Marie-Galante Airport|Grand-Bourg|Marie-Galante|Guadeloupe|
|TFFR|PTP|Pointe-a-Pitre|Le Raizet Airport|Pointe-a-Pitre|Guadeloupe|

So Now I test winni function, but just on the following files. Argentina.txt, Brazil.txt, Canada.txt, FaroeIslands.txt, Germany.txt, Iceland.txt, Italy.txt, Maldives.txt, Moldova.txt, Romania.txt, Thailand.txt, Tonga.txt, Uruguay.txt, Vietnam.txt.




 
FPC 3.2.0, Lazarus IDE v2.0.4
 Windows 10 Pro 32-GB
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
4.1 TB

JLWest

  • Hero Member
  • *****
  • Posts: 1293
Re: Character Conversions
« Reply #43 on: September 28, 2019, 04:13:03 am »
Just in the interest of completeness  I just finished testing howardpc translation program. He had a program and I changed it to a unit so I could use it with with any program as needed.

None  of the tree translations units could translate the Vietnam file.  howardpc code and winni were very close, maybe howards was a little better. Right now the program dose not  save the translated file. It's in ia listbox. Haven't decided on naming conventions.

if anyone is interested I would post the program with some test data.

Send me a message or add a reply. 
 
FPC 3.2.0, Lazarus IDE v2.0.4
 Windows 10 Pro 32-GB
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
4.1 TB

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: Character Conversions
« Reply #44 on: September 28, 2019, 01:44:22 pm »
Hi!

As I wrote my code is for the utf8 characters of the european langanges.

One reason is that I don't know nothing about chinese or Sanskrit.
The second reason is the nearly "endless" utf8 table. What is necessary?
Do we need Cherokee?

Winni

 

TinyPortal © 2005-2018