Print Page - How to write all unicode chars?

Programming => General => Topic started by: dculp on November 13, 2019, 01:03:35 pm

Title: How to write all unicode chars?
Post by: dculp on November 13, 2019, 01:03:35 pm

I'm trying to write all unicode chars in a particular character set. The following code works for the first 255 chars. After that it just starts recycling the same chars even though i continues to increase. I'm pretty sure the problem is with chr() that expects a byte argument but I'm not sure what to use instead.

Running under Windows 7+

Thanks,
Don C.

Code: Pascal [Select][+]

program Write_unicode_chars_20a;
{$mode objfpc}{$H+}{$apptype console}
 
uses
   crt, // for other purposes
   sysutils; // for inttohex
 
var
   i: integer;
 
begin
for i:= 1 to $2666 do
   begin
   writeln(i:5, inttohex(i,1):5, chr(i):5);
   if ((i mod 20) = 0) then readln; // pause
   end;
readln;
end.
 
 

Title: Re: How to write all unicode chars?
Post by: Thaddy on November 13, 2019, 01:20:23 pm

- chr() is AnsiChar
- A UnicodeChar is at a minimum 2 bytes wide and maximum 4 bytes
- You need a console that supports unicode
So basically fix the above.

Title: Re: How to write all unicode chars?
Post by: dculp on November 13, 2019, 01:43:01 pm

Quote from: Thaddy on November 13, 2019, 01:20:23 pm

- You need a console that supports Unicode <== don't understand
So basically fix the above.

Suggestions on how to fix?

Title: Re: How to write all unicode chars?
Post by: Martin_fr on November 13, 2019, 01:44:33 pm

You can try widechar, and convert to utf8 (if you need utf8).
Or you need to create the byte-sequences yourself.

Btw, your code works at best for the first 128. After that it may print ansichars, but not unicode.
Unicode in can be encoded in utf8, utf16 and others. For none widestring, it should be utf8.
In Utf8 the unicode 128 is encoded as 2 bytes: char($c2) + char($80)

Also that will give you codepoints. It will not give you chars.
Unicode U+0308 https://www.fileformat.info/info/unicode/char/0308/index.htm is not a char.
It modifies the previous codepoint, and together they become a new char.
You can add many such modifiers, so there is no simple loop to print them all...

And other codepoints are surrogates, and can not be used on their own.
And there are gaps too. Not all numbers are defined. (but may get defined in future).

Also there are control chars.
https://www.fileformat.info/info/unicode/char/200f/index.htm
Will not print a char. But the rest of the line (in the same writeln statement) will be written in right to left direction. (if your console supports this)

Package LazUtils unit LazUtf8 has UnicodeToUTF8()
This will get you the codepoint as string.
You still need the terminal to support utf8.
And you still need to deal with combining/surrogates/gaps...

Title: Re: How to write all unicode chars?
Post by: dculp on November 13, 2019, 02:04:02 pm

OK, let me simplify. Let's say that I just want to write a single character -- e.g., a right-facing solid pointer (U+25BA according Windows CharMap.exe for either Arial, Concolas, or Lucida Console fonts). How would I do this? (Short code would be helpful.)

Title: Re: How to write all unicode chars?
Post by: Martin_fr on November 13, 2019, 02:35:33 pm

If you are on windows (tested win10) and if your font supports that char....

Code: Pascal [Select][+]

program Project1;
{$codepage utf8}
uses windows;
begin
  SetConsoleOutputCP(CP_UTF8);
  writeln(#$e2#$96#$ba);
  readln;
end.
 

The byte sequence is here https://www.fileformat.info/info/unicode/char/25BA/index.htm

Title: Re: How to write all unicode chars?
Post by: winni on November 13, 2019, 06:21:40 pm

Hi!

You can't show the UTF8-chars with a single linear loop like you do.

Because of the definition there are "holes" which are illegal values.

Have a look at the bit pattern of the UTF8-Code and the related hex values:

Code: Text [Select][+]

(* Convert table UTF8 <-> HEX =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 
   UTF8                         HEX
  Scalar Value                First Byte  Second Byte Third Byte  Fourth Byte
A 0xxxxxxx                    0xxxxxxx
B 00000yyy yyxxxxxx           110yyyyy    10xxxxxx
C zzzzyyyy yyxxxxxx           1110zzzz    10yyyyyy    10xxxxxx
D 000uuuuu zzzzyyyy yyxxxxxx  11110uuu    10uuzzzz    10yyyyyy    10xxxxxx
 
-=-=-=-=-=--=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=*)       

[/font]

This means
* If you got a byte with a leading bit=0 this is a One-Byte-UTF8-Code
* else: the starting bits of the first byte tell you the length of the code:
- 110 = 2byte
- 1110 = 3 byte
- 1111 = 4 byte

So as you can see from the pattern above there is never a leading zero bit at the second, third or fourth byte. A leading zero bit can only be at the start of a One-Byte-Char.

So you see: a lot of "holes" in the utf8 space.

A page where all UTF8-Codeblocks are listed is:

https://www.utf8-chartable.de/unicode-utf8-table.pl

Don't overlook the combobox with all blocks - third row from top in the box.

Winni

Title: Re: How to write all unicode chars?
Post by: winni on November 13, 2019, 10:29:45 pm

Hi!

To avoid terminal trouble I made a small procedure to show "all" of the UTF8-chars between
1 and $2666.

For speed reasons - I think - the illegal characters are not sorted out by the Lazarus UTF8 routines but show the substitute glyph like
" ".

Here we go:

Code: Pascal [Select][+]

uses ..... LazUTF8;
 
procedure TForm1.Button1Click(Sender: TObject);
var
i: integer;
hex : String;
utf8,tmp : string;
s : string= '';
 
begin
for i := 1 to $2666 do
  begin
   utf8 := UnicodeToUTF8(i);
   hex := intToHex(i,4);
   tmp :=  IntToStr(i)+' / '+'U+'+hex+' / ' +  utf8;
   s := s+tmp+lineEnding;
   if i mod 20 = 0 then
     begin
     showMessage (s);
     s := '';
  end; // mod
end; // for
end;                           
 

The forum software is too clever and delete the substitute glyph ...

Winni

Title: Re: How to write all unicode chars?
Post by: dculp on November 13, 2019, 10:36:19 pm

Quote from: Martin_fr on November 13, 2019, 02:35:33 pm

Code: Pascal [Select][+][-]
program Project1;
{$codepage utf8}
uses windows;
begin
SetConsoleOutputCP(CP_UTF8);
writeln(#$e2#$96#$ba);
readln;
end.

Under Windows 10 your code gives the desired right filled pointer for either Lucida Console or raster fonts.

For general purposes (outside of this demo) I need the crt unit. However, if crt is used then your code gives three individual characters (shown in the attached image). I assume that these are associated with the individual #$e2, #$96, and #$ba. I'm not sure why the writeln is "corrupted" by the crt since your code runs fine without the crt unit. In any case, is there is a way to output this char to the screen without using a writeln? Other thoughts? (Note - console app)

Title: Re: How to write all unicode chars?
Post by: Bart on November 13, 2019, 10:38:54 pm

IIRC the crt unit resets the console codepage upon every write it does.

Bart

Title: Re: How to write all unicode chars?
Post by: winni on November 13, 2019, 11:28:06 pm

Hi!

CRT and writeln are not a good team under this circumstances.

But

Code: Pascal [Select][+]

procedure ttySendStr(const s:string); 

is just sending out the chars = bytes.

And wenn your whole action is done then send a

Code: Pascal [Select][+]

ttyFlushOutput;

to send the last chars from the buffer.

Winni

Title: Re: How to write all unicode chars?
Post by: dculp on November 14, 2019, 06:41:47 pm

winni --

Your suggested procedures are in the unix crt.pp. I'm running under Windows 7+. I couldn't find these procedures in the windows crt.pp. I searched windows crt.pp for something similar didn't see anything, even for those procedures that are only in the implementation section. Perhaps I have overlooked something? Any suggestions?

Title: Re: How to write all unicode chars?
Post by: winni on November 14, 2019, 08:35:25 pm

Hmm

Sorry, I did not recognize that you need a win solution.

What is needed from the CRT unit??
Perhaps we can find a way to get around CRT.

Winni

Title: Re: How to write all unicode chars?
Post by: dculp on November 15, 2019, 02:54:35 am

Quote from: Bart on November 13, 2019, 10:38:54 pm

IIRC the crt unit resets the console codepage upon every write it does.

Bart

Is there any way to circumvent this? Then I could set my own codepages as needed (e.g., before and after writes). (I couldn't find anything when going through the crt.pp unit but perhaps I don't know what to look for.)

Title: Re: How to write all unicode chars?
Post by: winni on November 15, 2019, 03:34:43 am

Hi!

That's the reason why I'm asking:

What do you need so urgent from CRT??
gotoxy, ClrScr or other procedures?

I want to find out how we can get around the CRT unit.

Winni

Title: Re: How to write all unicode chars?
Post by: Martin_fr on November 15, 2019, 04:15:38 am

Maybe a look at the begin of the crt unit can help ....

Code: Pascal [Select][+]

procedure SetSafeCPSwitching(Switching:Boolean);
procedure SetUseACP(ACP:Boolean);
 

Code: Pascal [Select][+]

    UseACP        : Boolean; (* True means using active process codepage for
                                console output, False means use the original
                                setting (usually OEM codepage). *)
    SafeCPSwitching : Boolean; (* True in combination with UseACP means that
                                  the console codepage will be set on every
                                  output, False means that the console codepage
                                  will only be set on Initialization and
                                  Finalization *)
 

I haven't done any tests, but those global vars sound like they are all you need. Though no idea if they have side effects.

Title: Re: How to write all unicode chars?
Post by: lucamar on November 15, 2019, 11:45:00 am

Quote from: dculp on November 15, 2019, 02:54:35 am

Quote from: Bart on November 13, 2019, 10:38:54 pm
IIRC the crt unit resets the console codepage upon every write it does.

Is there any way to circumvent this? Then I could set my own codepages as needed (e.g., before and after writes).

This bit of code should do the trick:

Code: Pascal [Select][+]

Assign(Input, ''); Reset(Input);
Assign(Output, ''); Rewrite(Output);

Any Read/Write from/to console should then go through the "normal" channels rather than CRT's replacements.

Note, though, that I've not tested it.

Title: Re: How to write all unicode chars?
Post by: dculp on November 15, 2019, 12:00:19 pm

Martin_fr --

This seems like exactly what I need.

After searching through all files on my Lazarus distribution (2.0.2, FPC 3.0.4), I couldn't find either SetSafeCPSwitching or SetUseACP. However, I did find them here (
https://github.com/graemeg/freepascal/blob/master/packages/rtl-console/src/win/crt.pp) and a discussion here (https://bugs.freepascal.org/view.php?id=32558). The discussion from the last link is the exact problem that I was ultimately trying to resolve (writing extended ASCII chars under Windows 10). I have not read through the entire discussion but I see that it was resolved on 2017-12-14 but not scheduled for release until v.3.2.0 (so not in the current release 3.0.4).

I'll try this revised crt.pp.

Title: Re: How to write all unicode chars?
Post by: Martin_fr on November 15, 2019, 02:31:11 pm

Quote from: dculp on November 15, 2019, 12:00:19 pm

After searching through all files on my Lazarus distribution (2.0.2, FPC 3.0.4), I couldn't find either SetSafeCPSwitching or SetUseACP.

They are part of fpc. If you used the official installer, then that is in a subfolder of your install.

Anyway, you can always use code navigation.
"uses crt;"
ctrl left mouse click on crt.

Title: Re: How to write all unicode chars?
Post by: lucamar on November 15, 2019, 02:46:32 pm

Quote from: Martin_fr on November 15, 2019, 02:31:11 pm

Quote from: dculp on November 15, 2019, 12:00:19 pm
After searching through all files on my Lazarus distribution (2.0.2, FPC 3.0.4), I couldn't find either SetSafeCPSwitching or SetUseACP.

They are part of fpc. If you used the official installer, then that is in a subfolder of your install.

Not in FP 3.0.4; they'll be in 3.2

Title: Re: How to write all unicode chars?
Post by: dculp on November 15, 2019, 03:02:34 pm

Quote

They are part of fpc. If you used the official installer, then that is in a subfolder of your install.

Yes, I can find the crt.pp unit in my Lazarus folder path. However, my crt.pp unit has neither SetSafeCPSwitching nor SetUseACP (see short intro code comparisons below). I had downloaded Lazarus 2.0.2 (FPC 3.0.4; Windows 32 bits version) from the Lazarus website and then installed directly from the download.

My crt.pp --

Code: Pascal [Select][+]

unit crt;
 
interface
 
{$i crth.inc}
 
procedure Window32(X1,Y1,X2,Y2: DWord);
procedure GotoXY32(X,Y: DWord);
function WhereX32: DWord;
function WhereY32: DWord;
 
implementation
 
uses
  windows;
 

Updated crt.pp from https://github.com/graemeg/freepascal/blob/master/packages/rtl-console/src/win/crt.pp --

Code: Pascal [Select][+]

unit crt;
 
interface
 
{$i crth.inc}
 
procedure SetSafeCPSwitching(Switching:Boolean);
procedure SetUseACP(ACP:Boolean);
procedure Window32(X1,Y1,X2,Y2: DWord);
procedure GotoXY32(X,Y: DWord);
function WhereX32: DWord;
function WhereY32: DWord;
 
implementation
 
{$DEFINE FPC_CRT_CTRLC_TREATED_AS_KEY}
(* Treatment of Ctrl-C as a regular key ensured during initialization (SetupConsoleInput). *)
 
uses
  windows;
 

Title: Re: How to write all unicode chars?
Post by: dculp on November 15, 2019, 06:13:48 pm

Quote from: lucamar on November 15, 2019, 02:46:32 pm

Quote from: Martin_fr on November 15, 2019, 02:31:11 pm
Quote from: dculp on November 15, 2019, 12:00:19 pm
After searching through all files on my Lazarus distribution (2.0.2, FPC 3.0.4), I couldn't find either SetSafeCPSwitching or SetUseACP.

They are part of fpc. If you used the official installer, then that is in a subfolder of your install.

Not in FP 3.0.4; they'll be in 3.2

Is there any way to use the new crt unit? I tried just copying it to D:\Lazarus_32bit_2.0.x\fpc\3.0.4\source\packages\rtl-console\src\win but compiling a simple test program (below) gave an error "Identifier not found" for both SetUseACP and SetSafeCPSwitching.

I then tried deleting crt.ppu in D:\Lazarus_32bit_2.0.x\fpc\3.0.4\units\i386-win32\rtl-console but then got the error "Write_ASCII_extended_chars_10a.pas(11,1) Fatal: Cannot find crt used by Write_ASCII_extended_chars_10a. Make sure all ppu files of a package are in its output directory. ppu in wrong directory=D:\Lazarus_32bit_2.0.x\fpc\3.0.4\units\i386-win32\rtl-console\crt.ppu..".

I had assumed that the required crt.ppu would be created during compilation of the test program but I couldn't find it anywhere.

How can I get the correct crt.ppu for the new crt.pp? Can it just be uploaded (if it exists) and copied to the required folder? What else would I need to do?

Any idea on when FP 3.2 will be released? (Since this issue with the extended ASCII chars was resolved almost two years ago, I might have hoped that the fix would have been incorporated by now.)

Code: Pascal [Select][+]

program Write_ASCII_extended_chars_10a;
uses
   crt;
 
(* Also tried explicit path -
uses
   crt in 'D:\Lazarus_32bit_2.0.x\fpc\3.0.4\source\packages\rtl-console\src\win\';
*)
 
begin
clrscr; // just to test that the crt is actually being used
SetUseACP(true);
SetSafeCPSwitching(false);
end.
 

Title: Re: How to write all unicode chars?
Post by: Martin_fr on November 15, 2019, 06:49:52 pm

First of all, sorry I did not realize I was looking at my 3.0.2 install...

The proper way, install 3.2

The "maybe works" way: Copy the crt.pas file into your project.
However, if you use any other fcp-unit, or fpc/laz-package that uses crt, then that will not work.

I don't know if anything else in the rtl uses crt. If so, then you need to install 3.2. Or do a custom build of the entire 3.0.4

Title: Re: How to write all unicode chars?
Post by: dculp on November 15, 2019, 07:08:52 pm

Martin_fr --

I was able to compile by adding crt.pp to the Project Inspector files (see image) and then adding the path to crth.inc (required included file for crt.pp) in the project paths. This gave crt.ppu in subfolder lib\i386-win32. Perhaps this is what you have suggested.

Where can I find v.3.2 for Windows 32bit? I didn't see it on the Lazarus site. (Note - I don't know anything about compiling FPC if this is required.)

Title: Re: How to write all unicode chars?
Post by: Martin_fr on November 15, 2019, 09:42:03 pm

FPC 3.2 is still beta. It has not yet been released. And it is still changing.

A download of a recent beta build can be found here: https://sourceforge.net/projects/lazarus-snapshots/files/
Go for lazarus-2.0.6-62131-fpc-3.2.0-beta-43271 (43271 is the svn revision of fpc svn fixes branch)
You can install it as "secondary install" => check the checkbox, choose a new install dir, and for config specify an empty folder (can be a sub-folder of the new install dir)

But you should be fine with the copied file.
Path to inc is ok. Alternative copy the inc too.

This works, because you deleted the original crt.ppu.
If you want to keep this, you can rename the copy (including the unit name in the source), and use the unit by its new name. (should work / not tested)