Lazarus

Free Pascal => Beginners => Topic started by: JLWest on August 18, 2019, 12:10:38 am

Title: Extended ASCII Chars Ord Value Questions
Post by: JLWest on August 18, 2019, 12:10:38 am
I create the ASCII character set in a listbox using the following code. But when I try to convert the characters some don't convert back to the same integer value.


Code: Pascal  [Select]
  1. unit Unit1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. interface
  6.  
  7. uses
  8.   Classes, SysUtils, Forms, Controls, Graphics, Dialogs,
  9.   StdCtrls,  StrUtils, LazUTF8;
  10.  
  11. type
  12.  
  13.   { TForm1 }
  14.  
  15.   TForm1 = class(TForm)
  16.     Edit1: TEdit;
  17.     Edit2: TEdit;
  18.     Label1: TLabel;
  19.     ListBox1: TListBox;
  20.  
  21.     procedure FormCreate(Sender: TObject);
  22.     procedure ListBox1Click(Sender: TObject);
  23.  
  24.  
  25.   private
  26.  
  27.   public
  28.  
  29.   end;
  30.  
  31. var
  32.   Form1: TForm1;
  33.  
  34. implementation
  35.  
  36. {$R *.lfm}
  37.  
  38. { TForm1 }
  39.  
  40. procedure TForm1.ListBox1Click(Sender: TObject);
  41.  Var i : Integer = -1;
  42.   Bit1 : String;
  43.   Item : String;
  44.  
  45. begin
  46.  i := ListBox1.ItemIndex;
  47.  if (i = -1) or (i = 0) then begin Exit; end;
  48.  Item := Listbox1.Items[i];
  49.  
  50.  Bit1 := Copy2SpaceDel(item);
  51.  Item := Trim(Item);
  52.  Bit1 := Copy2SpaceDel(item);
  53.  Item := Trim(Item);
  54.  Bit1 := Copy2SpaceDel(item);
  55.  Item := Trim(Item);
  56.  
  57.  Label1.Caption :=  Item;
  58.  Edit1.Text := Item;
  59.  
  60.  Item := IntToStr(Ord(Item[1]));
  61.  Edit2.Text := Item;
  62.  i := i;
  63.  
  64.  
  65. end;
  66.  
  67. procedure TForm1.FormCreate(Sender: TObject);
  68. var
  69.   i: Integer;
  70. begin
  71.   ListBox1.Items.Add('Ascii ' + IntToStr(32) + ' = ' + 'Space'  );
  72.   for i := 33 to 255 do begin
  73.     ListBox1.Items.Add('Ascii ' + IntToStr(i) + ' =    ' + WinCPToUTF8(String(Chr(i)))  );
  74.   end;
  75. end;
  76.  
  77.  
  78.  
  79. end.
  80.  
Title: Re: Extended ASCII Chars Ord Value Questions
Post by: jamie on August 18, 2019, 12:59:07 am
I think you need to use CP850TOUTF8 function instead.

 it could also be CP437ToUTF8

 If you are trying for the old IBM / DOS sets I believe those are a good starting point.
Title: Re: Extended ASCII Chars Ord Value Questions
Post by: winni on August 18, 2019, 01:12:08 am
 Yes, follow jamies hints.

To make it clear: You are working with utf8, which is the Lazarus standard. Utf8 and ASCII are only the same in [32..127].

In [128..255] you get the "latin supplement" - look here:

https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block) (https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block))

The days of ASCII, ANSI and IBM8 are gone.

Winni

Title: Re: Extended ASCII Chars Ord Value Questions
Post by: jamie on August 18, 2019, 01:16:03 am
Hey, I resent that or wait, resemble that  :o

Yes my forehead is shiny these days!
Title: Re: Extended ASCII Chars Ord Value Questions
Post by: jamie on August 18, 2019, 02:45:51 am
I am looking at the lconvEncoding file, looks like there is a lot of work in there, many case steps.

Wouldn't it be more efficient to use a 2 dim static array and do a quick scan on it?

Also I notice many CP..To..XX function call the same inner function but I don't see any inline attempt being made ? It would save on stack code allocations and speed things up. I mean you still need a to call a function but this is a double step instead of a single step.
 
 Could also do it like apple does, have a resource table in the bundle folder that it could read per code page and this could be easily edited for corrections or additions.

 Something to think about I guess.

Title: Re: Extended ASCII Chars Ord Value Questions
Post by: Handoko on August 18, 2019, 08:02:01 am
@JLWest

I think I have solved you issue.

Using my Character Map, these are what I found:
- ASCII #128 .. #191 will be mapped to #194 + C
- ASCII #192 .. #255 will be mapped to #195 + (C-64)
- ASCII #127 .. #160 are non-displayable characters (at least on my system)

My solution is to write 2 functions: ASCII2UTF8 and UTF82ASCII:

Code: Pascal  [Select]
  1. function ASCII2UTF8(C: Char): string;
  2. begin
  3.   Result := '';
  4.   case C of
  5.     #128..#191 : Result := chr(194) + C;
  6.     #192..#255 : Result := chr(195) + chr(Ord(C)-64);
  7.     else
  8.       Result := C;
  9.   end;
  10. end;
  11.  
  12. function ASCII2UTF8(B: Byte): string;
  13. begin
  14.   Result := ASCII2UTF8(chr(B));
  15. end;
  16.  
  17. function UTF82ASCII(const S: string): Char;
  18. var
  19.   C1, C2: Char;
  20. begin
  21.   Result := #0;
  22.   if Length(S) <= 1 then
  23.   begin
  24.     if S = '' then Exit;
  25.     Result := S[1];
  26.     Exit;
  27.   end;
  28.   C1 := S[1];
  29.   C2 := S[2];
  30.   case C1 of
  31.     #194 : Result := C2;
  32.     #195 : Result := chr(Ord(C2)+64);
  33.   end;
  34. end;

My solution was only tested on Ubuntu Mate GTK2, it may or may not works on Windows. Also, it does not try to correctly remap the characters $7F..$A0 as they are non displayable on my system (see img2), I have no clue how to map them.

Below is the whole source code:
Code: Pascal  [Select]
  1. unit Unit1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. interface
  6.  
  7. uses
  8.   Classes, SysUtils, Forms, Controls, StdCtrls, StrUtils;
  9.  
  10. type
  11.  
  12.   { TForm1 }
  13.  
  14.   TForm1 = class(TForm)
  15.     Edit1: TEdit;
  16.     Edit2: TEdit;
  17.     Label1: TLabel;
  18.     ListBox1: TListBox;
  19.     procedure FormCreate(Sender: TObject);
  20.     procedure ListBox1Click(Sender: TObject);
  21.   end;
  22.  
  23. var
  24.   Form1: TForm1;
  25.  
  26. implementation
  27.  
  28. {$R *.lfm}
  29.  
  30. { TForm1 }
  31.  
  32. function ASCII2UTF8(C: Char): string;
  33. begin
  34.   Result := '';
  35.   case C of
  36.     #128..#191 : Result := chr(194) + C;
  37.     #192..#255 : Result := chr(195) + chr(Ord(C)-64);
  38.     else
  39.       Result := C;
  40.   end;
  41. end;
  42.  
  43. function ASCII2UTF8(B: Byte): string;
  44. begin
  45.   Result := ASCII2UTF8(chr(B));
  46. end;
  47.  
  48. function UTF82ASCII(const S: string): Char;
  49. var
  50.   C1, C2: Char;
  51. begin
  52.   Result := #0;
  53.   if Length(S) <= 1 then
  54.   begin
  55.     if S = '' then Exit;
  56.     Result := S[1];
  57.     Exit;
  58.   end;
  59.   C1 := S[1];
  60.   C2 := S[2];
  61.   case C1 of
  62.     #194 : Result := C2;
  63.     #195 : Result := chr(Ord(C2)+64);
  64.   end;
  65. end;
  66.  
  67. procedure TForm1.ListBox1Click(Sender: TObject);
  68. Var
  69.   Item : string;
  70.   Bit1 : string;
  71.   i    : Integer = -1;
  72. begin
  73.   i := ListBox1.ItemIndex;
  74.   if (i = -1) or (i = 0) then Exit;
  75.   Item := Listbox1.Items[i];
  76.  
  77.   Bit1 := Copy2SpaceDel(item);
  78.   Item := Trim(Item);
  79.   Bit1 := Copy2SpaceDel(item);
  80.   Item := Trim(Item);
  81.   Bit1 := Copy2SpaceDel(item);
  82.   Item := Trim(Item);
  83.   Label1.Caption := Item;
  84.   Edit1.Text     := Item;
  85.  
  86.   Edit2.Text := Ord(UTF82ASCII(Item)).ToString;
  87. end;
  88.  
  89. procedure TForm1.FormCreate(Sender: TObject);
  90. var
  91.   i: Integer;
  92. begin
  93.   ListBox1.Items.Add('Ascii ' + IntToStr(32) + ' = ' + 'Space');
  94.   for i := 33 to 255 do
  95.     ListBox1.Items.Add('Ascii ' + IntToStr(i) + ' =    ' + ASCII2UTF8(i));
  96. end;
  97.  
  98. end.
Title: Re: Extended ASCII Chars Ord Value Questions
Post by: Thaddy on August 18, 2019, 08:21:44 am
Why not:
Code: Pascal  [Select]
  1. function cvAnsiToUni(const a:AnsiChar):UnicodeChar;inline;
  2. begin
  3.   Result := a; // compiler converts this.
  4. end;
  5.  

The unicodechar is assignment compatible to utf8char and this code will also work in console apps (needs an unicode  terminal);
Title: Re: Extended ASCII Chars Ord Value Questions
Post by: Handoko on August 18, 2019, 08:34:52 am
I've just test tested your suggestion. Unfortunately cvAnsiToUni only works on standard ASCII characters. On extended ASCII characters, it shows a question mark symbol.
Title: Re: Extended ASCII Chars Ord Value Questions
Post by: Thaddy on August 18, 2019, 08:49:22 am
Unexpected. should work.
Title: Re: Extended ASCII Chars Ord Value Questions
Post by: JLWest on August 18, 2019, 08:58:03 am
I am looking at the lconvEncoding file, looks like there is a lot of work in there, many case steps.

Wouldn't it be more efficient to use a 2 dim static array and do a quick scan on it?

Also I notice many CP..To..XX function call the same inner function but I don't see any inline attempt being made ? It would save on stack code allocations and speed things up. I mean you still need a to call a function but this is a double step instead of a single step.
 
 Could also do it like apple does, have a resource table in the bundle folder that it could read per code page and this could be easily edited for corrections or additions.

 Something to think about I guess.

I thought a resource file would be the thing unfortunately  I can't figure out how to set one up or use it.
Title: Re: Extended ASCII Chars Ord Value Questions
Post by: Munair on August 18, 2019, 10:30:55 am
This article explains very well why UTF8 and Extended Ascii (128..255) collide.
https://iconoun.com/articles/collisions/
Title: Re: Extended ASCII Chars Ord Value Questions
Post by: Thaddy on August 18, 2019, 11:20:26 am
I have this:
Code: Pascal  [Select]
  1. uses iconvenc;
  2.  
  3.   function AnsiCharToUnicode(const a:ansichar;cp:string ='CP1250'):string;inline;
  4.   begin
  5.     Result:='';
  6.     // should test for inconvert() = 0,
  7.     // but if the conversion fails result is still empty
  8.     iconvert(a,result,cp,'UTF-8');
  9.   end;
  10.  
Title: Re: Extended ASCII Chars Ord Value Questions
Post by: wp on August 18, 2019, 11:45:57 am
But when I try to convert the characters some don't convert back to the same integer value.

You encode the ANSI characters with WinCPToUTF8 for populating the listbox, but when you want to extract the numeric value back you do not call the inverse function UTF8ToCP. This is how it works:

Code: Pascal  [Select]
  1. procedure TForm1.ListBox1Click(Sender: TObject);
  2.  Var
  3.   i : Integer = -1;
  4.   Item : String;
  5.   ch: Char;
  6.   p: Integer;
  7.  
  8. begin
  9.   i := ListBox1.ItemIndex;
  10.   if (i = -1) then
  11.     Exit;
  12.  
  13.   Item := Listbox1.Items[i];
  14.   p := pos('=', Item);
  15.   Item := Trim(Copy(Item, p+1, MaxInt));
  16.  
  17.   Label1.Caption :=  Item;
  18.   Edit1.Text := Item;
  19.  
  20.   if Item = 'Space' then
  21.     ch := #32
  22.   else
  23.     // UTF8ToWinCP converts the string "Item" to an Ansistring consisting here of 1 character only.
  24.     // However, it cannot be applied to the function ord() because that requires a Char as argument.
  25.     // Therefore, we extract the first (and only) character of the 1-character string "Item".
  26.     ch := UTF8ToWinCP(Item)[1];
  27.   Edit2.Text := IntToStr(Ord(ch));
  28. end;
Title: Re: Extended ASCII Chars Ord Value Questions
Post by: JLWest on August 18, 2019, 05:59:46 pm
@WP
Yea I see. Well I wasn't aware there was a UTF8ToWinCp() function. The code I wrote (Or didn't write) was copied from this site and put together. 

I wasn't very sure if this was going to work. What I was after was a function that I could pass a character to and if it was an extended character it would  return  an ASCII. Something like this:

function TForm1.CharacterSwap(ASTRING : String) : String;
 Var i : Integer ;
  Item : String[1];
 Begin
 ?
  Result := Item.
 end;


 Question What's with this?  UTF8ToWinCP(Item)[1];
Item is a string and CP is a Character so I assume you are passing the first character of Item as a parameter to
UTF8ToWinCP.

Why wouldn't it be written UTF8ToWinCP(Item[1]); ?



Title: Re: Extended ASCII Chars Ord Value Questions
Post by: jamie on August 18, 2019, 06:47:23 pm
The function accepts and returns a string.

In your case "item" is a string that represents a single character so there is no need to index it or nor should  you for the parameter.

 The returning type is also a string but you are setting  it to a CHAR which is only 1 byte which is why it's being index so that only a character is returned instead.

 Getting back to your project, it seems that you may still be working on the same one you were before, are you really sure the extended set isn't the old 850/437 code page? I don't thing 1251 supports all of those but I could be wrong, been there before  %)
Title: Re: Extended ASCII Chars Ord Value Questions
Post by: wp on August 18, 2019, 08:05:50 pm
Question What's with this?  UTF8ToWinCP(Item)[1];
Item is a string and CP is a Character so I assume you are passing the first character of Item as a parameter to
UTF8ToWinCP.

Why wouldn't it be written UTF8ToWinCP(Item[1]); ?
"CP" here does not mean "character" but "codepage". The function UTF8ToWinCP converts a string from UTF8-encoding to the codepage used by your windows. While in the codepage-based encoding each "character" is 1 byte, a "character" in UTF8 consists of up to 4 bytes, i.e. the concept of the char datatype is not applicable for UTF8-encoded strings.

For some reason you want to create the character map of the characters on your code page.

Since all Lazarus controls work with UTF8 we convert the codepage-based characters from the Windows codepage to UTF8 (WinCPToUTF8) and display them in a listbox. This way the codepage character 'Á' (ordinal value 193, or $C1) becomes the utf8 string #$C3#$81 (which is displayed as 'Á' in the Listbox) (use the Lazarus character map to verify these values!).

When the user clicks on a listbox item we want to display the ordinal value of the displayed utf8 "character". In order to determine the ordinal value we use the "ord()" function which gets a Pascal char as input parameter. But: the "character" selected in the listbox is not a Pascal char, but a UTF8 string. The string was created by the function WinCPToUTF8, therefore we apply the inverse function UTF8ToWinCP to convert the string from UTF8 to the system code-page: it takes a UTF8 string as input parameter and returns its code-page encoded ansistring counterpart. In above example, the input string would be #$C3#$81, and the output string would be #$C1. Although the output string consists only of a single character it is still a string and thus not accepted by the "ord()" function which wants a char variable. Therefore, we extract the first byte of the string which is the equivalent of a char variable - this happens by applying the "[1]" to the string. Since the string consists of only a single character nothing is lost when doing so.

Therefore, the entire determination of the ordinal value of the selected listbox item has to be done like this (here, step by step):
Code: Pascal  [Select]
  1. var
  2.   s: String;
  3.   ch: char;
  4.   ordVal: Integer;
  5. ...
  6.   // Item is the UTF8-equivalent string of a code-page character.
  7.   s := UTF8ToWinCP(Item);   // s has the encoding of the code page
  8.   ch := s[1];   // use only the 1st character of the code-page string as a Pascal char variable, well, it's the only character here
  9.   ordVal := ord(ch);  // determine the ordinal value of this char variable

You are asking about some similar sequence:
Code: Pascal  [Select]
  1. UTF8ToWinCP(Item[1]);
There are two mistakes:
 
Title: Re: Extended ASCII Chars Ord Value Questions
Post by: JLWest on August 19, 2019, 12:16:34 am
The function accepts and returns a string.

In your case "item" is a string that represents a single character so there is no need to index it or nor should  you for the parameter.

 The returning type is also a string but you are setting  it to a CHAR which is only 1 byte which is why it's being index so that only a character is returned instead.

 Getting back to your project, it seems that you may still be working on the same one you were before, are you really sure the extended set isn't the old 850/437 code page? I don't thing 1251 supports all of those but I could be wrong, been there before  %)

Jamie I'm not really that sure of anything when it comes to character sets, code pages and character conversions.
Title: Re: Extended ASCII Chars Ord Value Questions
Post by: JLWest on August 19, 2019, 12:24:22 am
@All

I'll have to play with this a bit to try ad figure it out.

Thanks