Recent

Author Topic: ASCII characters Questions  (Read 4984 times)

JLWest

  • Hero Member
  • *****
  • Posts: 1293
ASCII characters Questions
« on: May 11, 2019, 02:50:54 pm »
The word I'm trying to display in ASCII Code  is 'Afrânio'.

The Demo program gives me:
'A' = 65                                       Ok
'f' = 102                                      OK
'r' = 114                                      OK
'├' = 195 for the 4th character      ? Have No Idea
' '  = 162 for the 5 th character     shows as a blank it as a 162
'n' = 110                                     shows 6th char  as an 'n'  but 5th in Afranio   
'i' = 111                                      shows 7th char  as an 'i' 'but 6 in Afrânio'.
'o' =  111                                    shows 8th char  as an 'o' but 7th in Afrânio'.
     
                                                     
 Line 155 (Lgth := Length(RCD);)   FPC says there are 8 characters. I count 7 visually.
There is something going on I can't figure out.
 


Code: Pascal  [Select][+][-]
  1. unit Unit1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. interface
  6.  
  7. uses
  8.   Classes, SysUtils, FileUtil, Forms, Controls, Graphics, Dialogs, StdCtrls;
  9.  type
  10.  
  11.   { TForm1 }
  12.  
  13.   TForm1 = class(TForm)
  14.     btnName: TButton;
  15.     Edit1: TEdit;
  16.     Edit2: TEdit;
  17.     Edit3: TEdit;
  18.     Edit4: TEdit;
  19.     Edit5: TEdit;
  20.  
  21.   procedure btnNameClick(Sender: TObject);
  22.   procedure Convert;
  23.   procedure FormCreate(Sender: TObject);
  24.  
  25.   private
  26.  
  27.   public
  28.  
  29.   end;
  30.  
  31. var
  32.   Form1: TForm1;
  33.  
  34. implementation
  35.  
  36. {$R *.lfm}
  37.  
  38.   procedure TForm1.FormCreate(Sender: TObject);
  39.   Var S : String =  'Afrânio';
  40.    begin
  41.     Edit3.Text := S;
  42.     Edit1.Text := '';
  43.     Edit2.Text := '';
  44.     Edit4.Text := '';
  45.     Edit5.Text := '';
  46.   end;
  47.  
  48.  procedure TForm1.Convert;
  49.   Var  idx : integer = -1;
  50.    RCD : String[10] = 'Afrânio';
  51.    AChar : string[1] = '';
  52.    i : Integer;
  53.    Lgth : Integer;
  54.   begin
  55.      Lgth := Length(RCD);
  56.     for idx := 1 to Lgth do begin
  57.        Edit4.Text := RCD[IDX];
  58.        Edit5.Text := IntToStr(Idx);
  59.        i := (Ord(RCD[idx]));
  60.        AChar := (IntToStr(Ord(RCD[idx])));
  61.        Edit1.Text := IntToStr(i);
  62.        Showmessage('');
  63.       end;
  64.     Edit1.Text := '';
  65.     Edit2.Text := '';
  66.     Edit4.Text := '';
  67.     Edit5.Text := '';
  68.   end;
  69.  
  70.  procedure TForm1.btnNameClick(Sender: TObject);
  71.    Var S : String =  'Afrânio';
  72.   begin
  73.    Convert;
  74.   end;
  75. end.
  76.                                
FPC 3.2.0, Lazarus IDE v2.0.4
 Windows 10 Pro 32-GB
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
4.1 TB

Birger52

  • Sr. Member
  • ****
  • Posts: 309
Re: ASCII characters Questions
« Reply #1 on: May 11, 2019, 03:04:04 pm »
â is 226 ASCII

But your sting is utf8, not ASCII
In utf8 â is c3 a2 (195 162) - two bytes. (https://www.i18nqa.com/debug/utf8-debug.html)
So the (byte)length of your sting is 8 - not 7
195 is à and 162 is ¢ in ASCII (https://www.rapidtables.com/code/text/ascii-table.html)

You probably need to use some other type than string...
And past that, I'm afraid you need to look to someone else for explanations/solutions.
Not that I won't - I do not know.
;)

Google "lazarus charater sets" for instance...
« Last Edit: May 11, 2019, 03:19:23 pm by Birger52 »
Lazarus 2.0.8 FPC 3.0.4
Win7 64bit
Playing and learning - strictly for my own pleasure.

JLWest

  • Hero Member
  • *****
  • Posts: 1293
Re: ASCII characters Questions
« Reply #2 on: May 11, 2019, 03:42:31 pm »
ASCII 226  Alt 226 = 'Γ'

I really don't understand this.
FPC 3.2.0, Lazarus IDE v2.0.4
 Windows 10 Pro 32-GB
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
4.1 TB

Zoran

  • Hero Member
  • *****
  • Posts: 1831
    • http://wiki.lazarus.freepascal.org/User:Zoran
Re: ASCII characters Questions
« Reply #3 on: May 11, 2019, 05:20:52 pm »
ASCII 226  Alt 226 = 'Γ'

I really don't understand this.

First, note that within one byte (8 bits), it is not possible to have more than 256 (28) different characters.

There are different character encodings. Some of them use only one byte (or even less, as ASCII, see bellow), and some of them use more, to be able to have more characters encoded.
Some encodings have variable length, the example of this is popular UTF-8 encoding.

So, some encondigs cannot show letter â, and some of them can.

ASCII standard is 7-bit encoding, so it has only 128 codes. It contains english upper and lower latin letters A-Z, a-z, digits 0-9, standard punctuation charactes, and control characters.
ASCII does not contain letter â.
Note that the fact that my text here contains the letter â means that this forum use another encoding, not plain ASCII.

Historicaly, it was a good solution for English language, Americans created it and it was all they need.

However, as our world (still) uses other languages, and most of these languages use more characters, ASCII was not a solution for them.

Then, ANSI encodings came. It is not one encoding, it is a family of one-byte encodings.
The idea is to use the fact that ASCII is 7-bit encoding and computers memory is always organized in bytes (8-bits), so each ASCII character, written in one byte has a leading zero bit. Putting 1 in the leading bit, allows to extend this encoding with 128 more characters, and keeping compatibility with plain ASCII standard.

Still, it is not possible to have all characters the world uses in one-byte, so ASCII got several extensions -- one for west European latin languages -- the non-english characters of west European languages (Portugese, Spanish, French, German, Swedish, etc.) could fit in one Ansi standard -- one way to extend ASCII -- you can find there German letters Ä, Ö, Ü, ß; Spanish ñ, etc. -- all of these have value in upper range (128-255), and the lower range (0-127) keeps compatibility with ASCII, that is why I said, it is one ASCII extension.

This ANSI west european encoding (CP-1252) has the letter â encoded as 226.

However, the other european languages still cannot fit in this encoding -- the upper range (128-255) is not enough for all east European languages.

Then, another ANSI encoding covers east European latin languages (Polish, Slovenian, etc.). Another is added for cyrilic languages, one for Greek, one for Arabic, etc.

Each of these ANSI standar has lower range (0-127) compatible with ASCII, but upper range (128-255) has different characters.

So, using ANSI extensions can be enough if you don't need charactes from different languages which are not covered in one of these standards, but you cannot write latin letters Č (you can find it in cp1250, but not in the other standards) and Ü (found in cp1252) in one text with any of these encodings.
The fact that you can see both of them in the previous sentence, means that this forum does not use any of these ANSI encodings!

Also, there are languages in this world which have more than 128 characters, and surely cannot be covered with one-byte ASCII extension.

Then, UCS2 was invented. It is two-byte encoding. Each character is represented with two bytes.
It is compatible with old ASCII in this sense -- the first 128 (0-127 range) characters are same as ASCII characters. So, they have zeros in first 9 bits and then ASCII codes.
The idea was that it should be enough for all.
Most languages fit there, but if you write for instance German, you can see that your text file encoded in UCS2 requires twice storage comparing to same text saved in ANSI (cp1252) encoded file.

Then, the genial idea came -- UTF8 -- variable length encoding. ASCII characters take one byte, and all other character take more (all european upper-range characters from ANSI encodings take two bytes, but some far eastern letters take three and even four).
For example, German text mentioned before, now saved in UTF8 encoded file will be almost same size as ANSI encoded -- only special German characters will take two bytes, but most of characters in the text (letters a-z, digits, standard punctuation) will take just one byte.

There are more beautiful things about UTF8 -- read this: http://wiki.freepascal.org/UTF8_strings_and_characters

Lazarus uses UTF8.
In UTF8, the letter â, which you need is encoded with two bytes -- 195 and 162.

Just to be complete -- two byte encoding UCS2 was just not enough for all the characters our world needs. So it is therefore deprecated now and there is also UTF16 encoding standard. I'm not going further into this, I hope I helped.

lucamar

  • Hero Member
  • *****
  • Posts: 4219
Re: ASCII characters Questions
« Reply #4 on: May 11, 2019, 05:32:02 pm »
I really don't understand this.

It's quite easy: the string is UTF8 encoded, so any character beyond the plain ASCII [0..127] (Unicode plane zero?) will be encoded in 2 to 5 bytes. Length(String) counts bytes, so it gives you the number of charcters plus the extra character byte of the encoding of "á". UTF8Length() will give you the number of characters.

â is 226 ASCII

No, it isn't. It may be #226 in some so-called extended-ASCII encodings or in some WIndows code-page(s), but plain ASCII is a seven bits code: it defines only characters #0..#127.

One must be precise in these matters or chaos ensues ;)
« Last Edit: May 11, 2019, 05:40:09 pm by lucamar »
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

VTwin

  • Hero Member
  • *****
  • Posts: 1215
  • Former Turbo Pascal 3 user
Re: ASCII characters Questions
« Reply #5 on: May 11, 2019, 05:44:31 pm »
ASCII 226  Alt 226 = 'Γ'

I really don't understand this.

Already said, but that is for extended-ASCII, not UTF-8.

In short, your program tries to convert 195 and 162 (one character) to two characters that don't exist (extended-ASCII  collision).

http://iconoun.com/articles/collisions/
« Last Edit: May 11, 2019, 05:46:57 pm by VTwin »
“Talk is cheap. Show me the code.” -Linus Torvalds

Free Pascal Compiler 3.2.2
macOS 12.1: Lazarus 2.2.6 (64 bit Cocoa M1)
Ubuntu 18.04.3: Lazarus 2.2.6 (64 bit on VBox)
Windows 7 Pro SP1: Lazarus 2.2.6 (64 bit on VBox)

JLWest

  • Hero Member
  • *****
  • Posts: 1293
Re: ASCII characters Questions
« Reply #6 on: May 11, 2019, 05:54:38 pm »
Her is what I'm try to do but can't figure it out yet although I' getting closer.

I have a table of Cities and one of Countries. All toll about 40,000.

Some are UTF8 and some ASCII I guess. I need a function when given a string that looks like the following: 'öäüèéàCUT'  will return the following:  in ASCII 'oaueeaCUT'.

As far as I can determine there isn't a function like that in fpc (I'm surprised) so I suppose one has to write one.





FPC 3.2.0, Lazarus IDE v2.0.4
 Windows 10 Pro 32-GB
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
4.1 TB

lucamar

  • Hero Member
  • *****
  • Posts: 4219
Re: ASCII characters Questions
« Reply #7 on: May 11, 2019, 06:07:12 pm »
Some are UTF8 and some ASCII I guess.

Rather think of it as all being UTF8. What seems to be ASCII is really an UTF8 string where all characters fall in the set [#32..#127].

And you're right: AFAICT there is no conversion function for what you want. I looked for it some time ago (to ASCIIfy filenames) and could find nothing so I built my own to translate the characters most common around here (Spain), forgetting about cyrillic, greek, etc. I keep adding chars to it whenever I encounter one I don't have :D
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

jamie

  • Hero Member
  • *****
  • Posts: 6131
Re: ASCII characters Questions
« Reply #8 on: May 11, 2019, 06:11:14 pm »
If you are in windows...

the table is utf8 encoded, its  using Extend ASCII which is fine I guess..

 The function  WinCPToUTF8(String(#226)); displays your letter because it converts it to a utf8..

You can convert the whole string If you like but remember this, the string will not be a one to one index after that.

The only true wisdom is knowing you know nothing

JLWest

  • Hero Member
  • *****
  • Posts: 1293
Re: ASCII characters Questions
« Reply #9 on: May 11, 2019, 06:18:02 pm »
@Jamie

The function  WinCPToUTF8(String(#226)); displays your letter because it converts it to a utf8..

You can convert the whole string If you like but remember this, the string will not be a one to one index after that.

Is WinCPToUTF8 a Windows API and if so what do I need in my use clause?

" but remember this, the string will not be a one to one index after that."
I don't really understand what you are saying here.
FPC 3.2.0, Lazarus IDE v2.0.4
 Windows 10 Pro 32-GB
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
4.1 TB

lucamar

  • Hero Member
  • *****
  • Posts: 4219
Re: ASCII characters Questions
« Reply #10 on: May 11, 2019, 06:21:52 pm »
the table is utf8 encoded, its  using Extend ASCII which is fine I guess..

No: character data encoded as UTF8 is Unicode.

Of course, you can convert it to any Windows Code Page, but first one must ascertain which code-page will cause the less damage. Which is not as difficult as it sounds ... unless it's Russian text citing a Chinese philosopher citing a French novelist :)

" but remember this, the string will not be a one to one index after that."
I don't really understand what you are saying here.

I think he means that there isn't a byte to byte (or char to char) correspondece between the original UTF8 string and the ANSI one. Which is quite logical, since the double-byte UTF8 character will be converted to a single-byte ANSI one.
« Last Edit: May 11, 2019, 06:24:25 pm by lucamar »
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

Zoran

  • Hero Member
  • *****
  • Posts: 1831
    • http://wiki.lazarus.freepascal.org/User:Zoran
Re: ASCII characters Questions
« Reply #11 on: May 11, 2019, 06:26:45 pm »
Rather think of it as all being UTF8. What seems to be ASCII is really an UTF8 string where all characters fall in the set [#32..#127].

No, I don't think so. I believe that when he says ASCII, he means ANSI.

jamie

  • Hero Member
  • *****
  • Posts: 6131
Re: ASCII characters Questions
« Reply #12 on: May 11, 2019, 06:42:39 pm »
He has strings from a file that is using the Extended ASCII letter sets. they are 128..255

in order for him to display the letter as it should look he needs to generate a utf8 string.

But this is the issue, as soon as he starts manipulating this data with utf8 strings, that value will be come
a lost value and thus be display as a ? or some other expected letter.
The only true wisdom is knowing you know nothing

JLWest

  • Hero Member
  • *****
  • Posts: 1293
Re: ASCII characters Questions
« Reply #13 on: May 11, 2019, 06:44:03 pm »
Maybe I'm saying this wrong or something.

I would like to convert UTF8 strings to ANSCII ( American Standard Code Information Interchange)

S : String = 'ÄÖÜß   ñâ'
C : String = '';
 
So if I called:      C  := ConvertUTF8ToASCII(S : String) : string;

I would get:        C = 'AOUB na'
FPC 3.2.0, Lazarus IDE v2.0.4
 Windows 10 Pro 32-GB
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig
4.1 TB

Zoran

  • Hero Member
  • *****
  • Posts: 1831
    • http://wiki.lazarus.freepascal.org/User:Zoran
Re: ASCII characters Questions
« Reply #14 on: May 11, 2019, 06:54:02 pm »
So if I called:      C  := ConvertUTF8ToASCII(S : String) : string;

I would get:        C = 'AOUB na'

No, you won't.
But Lucamar says he has some ASCIIfy function, he might share it with you:
And you're right: AFAICT there is no conversion function for what you want. I looked for it some time ago (to ASCIIfy filenames) and could find nothing so I built my own to translate the characters most common around here (Spain), forgetting about cyrillic, greek, etc. I keep adding chars to it whenever I encounter one I don't have :D

 

TinyPortal © 2005-2018