Recent

Author Topic: Extended ASCII use - 2  (Read 12251 times)

raymond

  • New member
  • *
  • Posts: 7
Extended ASCII use - 2
« on: January 06, 2022, 03:28:40 pm »
In the UTF8 code set 62 (55%) of the characters are for 'European' characters. !!!
( fpc for FreeDOS + code set 850 was nirvana ).
Does anybody KNOW how to manipulate strings/arrays of 'European' characters ??
Proved examples, please. I would faint with gratitude. Many thanks.


JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
Re: Extended ASCII use - 2
« Reply #1 on: January 06, 2022, 03:53:40 pm »
What do you mean? UTF-8 encoding supports the full Unicode.
If you mean the 7-bit ASCII by 'European' characters, then it gets easy because UTF-8 is compatible with 7-bit ASCII.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Extended ASCII use - 2
« Reply #2 on: January 06, 2022, 05:05:39 pm »
Use unit LazUTF8.

If you are working on a terminal/console app, you need to add LazUtils package where LazUTF8 is. You do that in:
  Project - Project Inspector
    Add - Add New Requirement
      Type LazU and choose LazUtils

What you call character is actually more of a string

Use UTF8Copy, UTF8Insert...etc


engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Extended ASCII use - 2
« Reply #3 on: January 06, 2022, 05:14:02 pm »
Code: Pascal  [Select][+][-]
  1. program Project1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. uses
  6.   {$IFDEF UNIX}{$IFDEF UseCThreads}
  7.   cthreads,
  8.   {$ENDIF}{$ENDIF}
  9.   Classes
  10.   ,LazUTF8
  11.   { you can add units after this };
  12.  
  13. var
  14.   s:string;
  15.   s1,s2:string;
  16. begin
  17.   s := 'ÄÇ';
  18.   WriteLn(s);
  19.   WriteLn(Length(s));//===> 4
  20.   WriteLn(UTF8Length(s));//===> 2
  21.   s1:=UTF8Copy(s,1,1);
  22.   WriteLn(s1); // Ä
  23.   s2:=UTF8Copy(s,2,1);
  24.   WriteLn(s2); // Ç
  25.   UTF8Insert(s2,s,1); // s is ÇÄÇ
  26.   WriteLn(s);
  27.   UTF8Delete(s,2,1);  // s is ÇÇ
  28.   WriteLn(s);
  29.   ReadLn;
  30. end.

From your previous post, add cwstring unit if you are using Linux
« Last Edit: January 06, 2022, 05:17:40 pm by engkin »

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Extended ASCII use - 2
« Reply #4 on: January 06, 2022, 05:25:37 pm »
In the UTF8 code set 62 (55%) of the characters are for 'European' characters. !!!

Not sure where you got that.

UTF8 is ASCII compatible encoding of Unicode. Unicode codepoints can take up to 4 bytes when encoded using UTF8.

A is one byte and is ASCII compatible.
Ä is two bytes and is not compatible with ASCII.

A "character" can use more than one codepoint.

The same "character" could be represented with different codepoints.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
Re: Extended ASCII use - 2
« Reply #5 on: January 06, 2022, 05:26:11 pm »
Yes. The sample code applies to any UTF-8 text. I didn't quite understand what 'European' characters meant.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Extended ASCII use - 2
« Reply #6 on: January 06, 2022, 05:28:39 pm »
Me neither, but I looked at his other post.

raymond

  • New member
  • *
  • Posts: 7
Re: Extended ASCII use - 2
« Reply #7 on: January 11, 2022, 02:52:51 pm »
Whilst I'm evaluating the replies I would say
ASCII means 'American Standard Code ..' and is 128 values.
An accented character only occurs in a European language and features in the
'Extended' part of  ASCII (in the UTF8 code set).
My interest is in using fpc without Lazarus, as was possible with fpc and FreeDOS and code set IBM 850, which covers all the 'European' characters.
UTF8 is a single-byte code set and should be usable as such.
In fpc can one use UTF8 as single bytes in arrays ?

tetrastes

  • Sr. Member
  • ****
  • Posts: 473
Re: Extended ASCII use - 2
« Reply #8 on: January 11, 2022, 03:20:11 pm »
UTF8 is NOT single-byte code set. It uses one byte only for first 128 (ASCII) characters. For others it uses up to 4 bytes.
So if you want what you call "'extended' part of ASCII" to be one-byte, change your system codepage to IBM850, and fpc will work as you want. ::)

Thaddy

  • Hero Member
  • *****
  • Posts: 14211
  • Probably until I exterminate Putin.
Re: Extended ASCII use - 2
« Reply #9 on: January 11, 2022, 03:53:39 pm »
UTF8 is a single-byte code set and should be usable as such.
In fpc can one use UTF8 as single bytes in arrays ?
No. UTF8 is NOT a single byte code set!
Indeed, CP_UTF8 can use up to 4 bytes and for e.g. arrays and database field you will have to reserve 4 bytes per char! This is often overlooked.
So the answer is no. You can not treat UTF8 as single byte.
« Last Edit: January 11, 2022, 03:57:19 pm by Thaddy »
Specialize a type, not a var.

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1313
Re: Extended ASCII use - 2
« Reply #10 on: January 11, 2022, 05:25:13 pm »



C̵̢̦̖̭̯̮̦̼̺̮̮̺̳̹̰̤̆̿̈́̓͆̄͌̉̂̏̑̌͋͝ŗ̴̡̻̫̻͖͍̙̥̬͔̩̖̰̱̈̌́̈́̀͛͋͌̈́͂̿̃͊̿͑̀̚à̴̢̦͓͕͍̘̰̯͔͕̈́͋̈̃̀̒̐͛͜ͅz̸̢̧̛̦̘̥̥̩͓͍̖͇͔͍̦͚̤͒̐̅̄͊͋̍͑́͒̅̎͝͝ỳ̸̨̫̳̤͕̬͈̹̭̜̠̻̹̅ ̵̢̨̧͉̥̹̜͔̬̬̞́̽̉̓̇̅̾͝ŝ̸͔̲̹̄̚ͅt̸̡̯͎̳̩̪͍͈̔̍̈́͘͜ũ̶̞͕̠̺̈́́̌͑̐̿̔̌͒͋̋͌̕͠f̸̡͚̀̐̉̾̋̑̕͠͠f̷̛̩͔̰̄̊͗̋͗̆̅̌͒̄̒͐͛͐͝,̸̢̮͙̜̬͍̱͈͙̟̳̬̑͗̾́̋̄̈́͜ͅ ̸̟͈͎̘͕̙̺͖̙̝̺̬͑ȩ̷̧̢͉͍͖͚̗̻̳͎̽͛͗̀̽͂͌̅͌͊̈̈̋̈́͘͘̚h̵͖̘͇̦̞̪̓̈̈́̄̆͗̅̆̓̿̒̄̂͋̓̚?̷̟̪͍̜̳̹͖̯͊̏̑͠͝



16 chars, 676 bytes.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
Re: Extended ASCII use - 2
« Reply #11 on: January 11, 2022, 05:26:16 pm »
@raymond, I suggest you forget IBM 850 and all other 'Extended ASCII' codepages. They lead to all kinds of agony when data is exchanged.
Just convert your data to Unicode. It has solved such problems already decades ago.

[Edit] Crazy stuff, eh?
Try to do that with Extended ASCII codepages!
« Last Edit: January 11, 2022, 05:30:30 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Extended ASCII use - 2
« Reply #12 on: January 11, 2022, 07:58:15 pm »
@raymond,

My interest is in using fpc without Lazarus, as was possible with fpc and FreeDOS and code set IBM 850, which covers all the 'European' characters.

1-Make sure to tell the compiler the codepage/encoding you use in your source code. For instance, if a file is using the same codepage you want, IBM 850, then add this to the top of file:
Code: Pascal  [Select][+][-]
  1. program project1;
  2.  
  3. {$mode objfpc}{$H+}
  4. {$Codepage cp850}//<--- Tell the compiler about the codepage used in this file
  5.  
  6. uses

2-Make sure your dos is using the same codepage by calling chcp

What output do you get for this test project:
Code: Pascal  [Select][+][-]
  1. program project1;
  2.  
  3. {$mode objfpc}{$H+}
  4. {$Codepage cp850}
  5.  
  6. var
  7.   s:array[0..3] of String;
  8.   ss:string;
  9.   i:byte;
  10.   c:char;
  11. begin
  12.   WriteLn('DefaultSystemCodePage: ',DefaultSystemCodePage);
  13.   WriteLn('TextRec(Output).CodePage: ',TextRec(Output).CodePage);
  14.   ss:='ÄÇýÝ';
  15.   WriteLn(ss);
  16.  
  17.   //CP850: #$80..#$FF
  18.   s[0]:='ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø׃';
  19.   s[1]:='áíóúñѪº¿®¬½¼¡«»░▒▓│┤ÁÂÀ©╣║╗╝¢¥┐';
  20.   s[2]:='└┴┬├─┼ãÃ╚╔╩╦╠═╬¤ðÐÊËÈıÍÎÏ┘┌█▄¦Ì▀';
  21.   s[3]:='ÓßÔÒõÕµþÞÚÛÙýݯ´­±‗¾¶§÷¸°¨·¹³²■';
  22.   for i:=$80 to $FF do
  23.   begin
  24.     if (i mod 32)=0 then
  25.     begin
  26.       WriteLn();
  27.       WriteLn();
  28.       WriteLn(s[i div 32 - 4]);
  29.     end;
  30.     if i in [$07,$08,$09,$0A,$0D] then
  31.       c:=' '
  32.     else
  33.       c:=char(i);
  34.     Write(c);
  35.   end;
  36. end.

tetrastes

  • Sr. Member
  • ****
  • Posts: 473
Re: Extended ASCII use - 2
« Reply #13 on: January 11, 2022, 08:48:48 pm »
Code: Pascal  [Select][+][-]
  1.   for i:=$80 to $FF do
  2.   begin
  3.     if (i mod 32)=0 then
  4.     begin
  5.       WriteLn();
  6.       WriteLn();
  7.       WriteLn(s[i div 32 - 4]);
  8.     end;
  9.     if i in [$07,$08,$09,$0A,$0D] then
  10.       c:=' '
  11.     else
  12.       c:=char(i);   // ! This is invalid in utf8 and leads to runtime error 101
  13.     Write(c);
  14.   end;
  15.  

Thaddy

  • Hero Member
  • *****
  • Posts: 14211
  • Probably until I exterminate Putin.
Re: Extended ASCII use - 2
« Reply #14 on: January 11, 2022, 09:01:35 pm »
Note that if you do not use Lazarus (As you stated) but only FPC, you can use {$mode delphiunicode} and char would be unicodechar. But that is by no means single byte, mostly double byte (the ucs2 part) and possibly  - again - 4 bytes.
This mode will make your life easier, though.
This example covers technically UCS2, but that is an ancestor to current unicode16 and a 4 byte unicodechar is not implemented as such:
Code: Pascal  [Select][+][-]
  1. program size;
  2. {$mode delphiunicode}
  3. begin
  4.   writeln(SizeOf(Char));
  5. end.
In an ideal world it should return 4, imho.  >:(
OTOH this was defined before the unicode extensions in the standard and fpc can handle the 4 case as well. Just like Delphi.
That does mean you still have to reserve 4 bytes per UnicodeChar, not two as per my example.
« Last Edit: January 11, 2022, 09:57:55 pm by Thaddy »
Specialize a type, not a var.

 

TinyPortal © 2005-2018