Recent

Author Topic: Extended ASCII gone wrong somewhere  (Read 3917 times)

stephanos

  • New Member
  • *
  • Posts: 14
Extended ASCII gone wrong somewhere
« on: August 19, 2021, 10:43:27 pm »
Dear All

I am using Free Pascal Lazarus 2.0.8.  I am writing a simple command line programme so as to test my understanding of and ability to use extended ASCII characters.  However, it has not gone smoothly.  I am using this table as a reference and for the most part it is accurate:
     https://theasciicode.com.ar/

Here is some code with comments about the output
Code: Pascal  [Select][+][-]
  1. program project1;
  2. uses crt, SysUtils;
  3. var
  4.   IsItValid : string;   isItValid2 : ansistring;
  5.   count, size : integer; A : AnsiChar;
  6. begin
  7.   count := 12; A := 'A';
  8.   isItValid := 'BAILE DAì AMIZADE.mp3';
  9.   writeln(IsItValid);              // output of alt + 141 is corrupted but looks like chr(195) ├
  10.   writeln(chr(141));              // output ì as expected  
  11.   writeln(isItValid[2]);          // output A as expected
  12.   writeln(isItValid[9]);          // output corrupted but looked like chr(195) ├
  13.   writeln(A);                         // output A as expected
  14.   writeln(count);                  // output 12 as expected
  15.   writeln(IntToStr(count));   // output 12 as expected
  16.   writeln(Ord(A));                // output 65 as expected
  17.   writeln(Ord(isItValid[9]));  // output 195, not expected
  18. // so I made the string into an ansi string, though I am at the edge of my knowledge here
  19.   isItValid2 := 'BAILE DAì AMIZADE.mp3';
  20.   writeln(IsItValid2); readln;  // output corrupted but looked like chr(195) ├
  21. end.

Alt + 141 is the lowercase letter I with an accent.  Except when placed in a string or ansi string.  When in either string it becomes Alt + 195 ├.

My intention is to perform validation on extended ASCII characters in file names for my mp3 files, as my player does not read many extended ASCII characters and if the extended ASCII characters appear in a file name, when the file name is written to a playlist file, the file will not play.  Validation will include writing the path/filename to a text file so that the file name can be changed and therefore used in a playlist file.

But things are not what they should be.  How can Alt + 141, become Alt + 195?

Any help, pitched at my low level of competence, much appreciated and needed.


Bart

  • Hero Member
  • *****
  • Posts: 4477
    • Bart en Mariska's Webstek
Re: Extended ASCII gone wrong somewhere
« Reply #1 on: August 19, 2021, 10:49:25 pm »
The Lazarus IDE stores everything in UTF8 encoding.
The type String in Lazarus is by default also UTF8.
So, the string contains more bytes than "characters", since the lowercase i with accent is made up of 2 bytes.

Bart

stephanos

  • New Member
  • *
  • Posts: 14
Re: Extended ASCII gone wrong somewhere
« Reply #2 on: August 19, 2021, 11:39:09 pm »
Greetings Bart

Thanks for the reply.  I do not fully understand it.

So what do I do about it?

Thanks and wait to hear

winni

  • Hero Member
  • *****
  • Posts: 2715
Re: Extended ASCII gone wrong somewhere
« Reply #3 on: August 20, 2021, 12:41:17 am »
Hi!

Get used to UTF8. It is now 30 years old.

A basic introduction at wikipedia:

https://en.wikipedia.org/wiki/UTF-8

UTF8 unites all different codepages - what you call "extended ASCII" - all together in one system. This cannot be done with one byte per character. A UTF8-Char is between 1 and 4 bytes long.  So it cannot be represented by the Pascal type "char" anymore, but it is now a string.

Unicode support in Lazarus:

https://wiki.freepascal.org/Unicode_Support_in_Lazarus

Winni


lucamar

  • Hero Member
  • *****
  • Posts: 4219
Re: Extended ASCII gone wrong somewhere
« Reply #4 on: August 20, 2021, 08:25:50 am »
Alternatively:
  • Set a {$codepage XXX} in your source so that all the literal strings are treated as SBCS with that codepage (see the charset unit of the RTL for possible values of XXX)
  • Declare your strings as AnsiString(XXX) or RawByteString
See section 3.2.4 - Single-Byte String Types (specifically Code page conversions) of the Free Pascal Reference Guide for more info.
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

Bart

  • Hero Member
  • *****
  • Posts: 4477
    • Bart en Mariska's Webstek
Re: Extended ASCII gone wrong somewhere
« Reply #5 on: August 20, 2021, 12:32:40 pm »
Thanks for the reply.  I do not fully understand it.

In UTF8 all plain ASCII (so up to #127) are stored as a single byte.
All other "characters" are stored as 2, 3 or 4 byte sequences.
This makes iteration through a UTF8 encoded string more complex than old style single byte encoding (ALIAS codepages).

The LazUTF8 unit from Lazarus has various functions to handle UTF8 encoded strings.
E.g. UTF8Length(): it returns the length in "utf8 characters", instead of the length in bytes (as Length() does): Utf8Length('Ä') is 1, whilst Length('Ä') is 2.

Displaying UTF8 encoded strings will be displayed as expcted in any visual component of Lazarus.

When you write to the console, you have to understand that the console has a different codepage alltogether. It can oly display 255 different characters, and it treats strings as being single byte encoded.
So your "i with accent", whic consist of 2 bytes is treated as 2 seperate chars, and how they look on the console is dependant on your codepage.
It will look different in my codepage (Dutch locale) than on e.g. a Windows with Russion locale settings.

As Lucamar suggested: you can fight the Lazarus system and declare your stings as being of a certain codepage and use rawbyte string to prevent the compiler form doing unwanted codepage conversion. But in the long run, you better go with the flow.

As you may know Delphi uses WideStrings (or UnicodeStrings) by default.
It is tempting to use that so you can iterate over a string as you used to do, assuming that a single "character" is defined in each singne WideChar, but you would be wrong.

Note: I write "character" where I mean the the visual glyph on the screen we normally interpret as being a character (and this probably only holds for western language, not for e.g. Farsi).
The term character is a bit fuzzy when it comes to Unicode.

Bart

winni

  • Hero Member
  • *****
  • Posts: 2715
Re: Extended ASCII gone wrong somewhere
« Reply #6 on: August 20, 2021, 02:53:47 pm »
Hi!

I made a little project for you that helps you to understand UTF8.

You can enter text in the edit fiel and in the stringgrid is shown, which UTF8-char occupies how many bytes

For instance these math symbols all contain 3 bytes:

⟃⟆⟐⟒⟓⟦⟧⟬⟭

Project and screenshot attached

Winni

Alextp

  • Hero Member
  • *****
  • Posts: 1416
    • UVviewsoft
Re: Extended ASCII gone wrong somewhere
« Reply #7 on: August 20, 2021, 04:01:18 pm »
I considered Bart's post as usefull one, so I posted it (with fixes) to
https://wiki.freepascal.org/String#String_type_in_Lazarus

stephanos

  • New Member
  • *
  • Posts: 14
Re: Extended ASCII gone wrong somewhere
« Reply #8 on: August 24, 2021, 01:33:49 am »
Dear All

Solved and this is how.  From the responses I learnt that displaying extended ASCII to a console is problematic.  As the eventual programme is graphical I did my experiment in a graphical environment.

I added LazUTF8 to ‘uses’.  Made a string variable called UTF8, created a string whose 8th character had ì in it (Alt + 141).  Then onClick, output the 8th character to a label.caption.  The output was correct.

Being able to see winni’s code and Bart’s comment ‘When you write to the console, you have to understand that the console has a different codepage alltogether. It can oly display 255 different characters, and it treats strings as being single byte encoded.’  were both useful.

Thanks also to everyone else and sorry most of the other posts did not mean anything to me

 

TinyPortal © 2005-2018