Lazarus

Programming => General => Topic started by: dculp on November 13, 2019, 01:03:35 pm

Title: How to write all unicode chars?
Post by: dculp on November 13, 2019, 01:03:35 pm
I'm trying to write all unicode chars in a particular character set. The following code works for the first 255 chars. After that it just starts recycling the same chars even though i continues to increase. I'm pretty sure the problem is with chr() that expects a byte argument but I'm not sure what to use instead.

Running under Windows 7+

Thanks,
Don C.

Code: Pascal  [Select][+][-]
  1. program Write_unicode_chars_20a;
  2. {$mode objfpc}{$H+}{$apptype console}
  3.  
  4. uses
  5.    crt, // for other purposes
  6.    sysutils; // for inttohex
  7.  
  8. var
  9.    i: integer;
  10.  
  11. begin
  12. for i:= 1 to $2666 do
  13.    begin
  14.    writeln(i:5, inttohex(i,1):5, chr(i):5);
  15.    if ((i mod 20) = 0) then readln; // pause
  16.    end;
  17. readln;
  18. end.
  19.  
  20.  
Title: Re: How to write all unicode chars?
Post by: Thaddy on November 13, 2019, 01:20:23 pm
- chr() is AnsiChar
- A UnicodeChar is at a minimum 2 bytes wide and maximum 4 bytes
- You need a console that supports unicode
So basically fix the above.
Title: Re: How to write all unicode chars?
Post by: dculp on November 13, 2019, 01:43:01 pm
- You need a console that supports Unicode <== don't understand
So basically fix the above.

Suggestions on how to fix?

Title: Re: How to write all unicode chars?
Post by: Martin_fr on November 13, 2019, 01:44:33 pm
You can try widechar, and convert to utf8 (if you need utf8). 
Or you need to create the byte-sequences yourself.

Btw, your code works at best for the first 128. After that it may print ansichars, but not unicode.
Unicode in can be encoded in utf8, utf16 and others. For none widestring, it should be utf8.
In Utf8 the unicode 128 is encoded as 2 bytes: char($c2) + char($80)

Also that will give you codepoints. It will not give you chars.
Unicode U+0308 https://www.fileformat.info/info/unicode/char/0308/index.htm is not a char.
It modifies the previous codepoint, and together they become a new char.
You can add many such modifiers, so there is no simple loop to print them all...

And other codepoints are surrogates, and can not be used on their own.
And there are gaps too. Not all numbers are defined. (but may get defined in future).

Also there are control chars.
https://www.fileformat.info/info/unicode/char/200f/index.htm
Will not print a char. But the rest of the line (in the same writeln statement) will be written in right to left direction. (if your console supports this)


Package LazUtils unit LazUtf8 has UnicodeToUTF8()
This will get you the codepoint as string.
You still need the terminal to support utf8.
And you still need to deal with combining/surrogates/gaps...
Title: Re: How to write all unicode chars?
Post by: dculp on November 13, 2019, 02:04:02 pm
OK, let me simplify. Let's say that I just want to write a single character -- e.g., a right-facing solid pointer (U+25BA according Windows CharMap.exe for either Arial, Concolas, or Lucida Console fonts). How would I do this? (Short code would be helpful.)
Title: Re: How to write all unicode chars?
Post by: Martin_fr on November 13, 2019, 02:35:33 pm
If you are on windows (tested win10) and if your font supports that char....

Code: Pascal  [Select][+][-]
  1. program Project1;
  2. {$codepage utf8}
  3. uses windows;
  4. begin
  5.   SetConsoleOutputCP(CP_UTF8);
  6.   writeln(#$e2#$96#$ba);
  7.   readln;
  8. end.
  9.  

The byte sequence is here https://www.fileformat.info/info/unicode/char/25BA/index.htm
Title: Re: How to write all unicode chars?
Post by: winni on November 13, 2019, 06:21:40 pm
Hi!

You can't show the UTF8-chars with a single linear loop like you do.

Because of the definition there are "holes" which are illegal values.

Have a look at  the bit pattern of the UTF8-Code and the related hex values:

Code: Text  [Select][+][-]
  1. (* Convert table UTF8 <-> HEX =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
  2.  
  3.    UTF8                         HEX
  4.   Scalar Value                First Byte  Second Byte Third Byte  Fourth Byte
  5. A 0xxxxxxx                    0xxxxxxx
  6. B 00000yyy yyxxxxxx           110yyyyy    10xxxxxx
  7. C zzzzyyyy yyxxxxxx           1110zzzz    10yyyyyy    10xxxxxx
  8. D 000uuuuu zzzzyyyy yyxxxxxx  11110uuu    10uuzzzz    10yyyyyy    10xxxxxx
  9.  
  10. -=-=-=-=-=--=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=*)      
[/font]

This means
* If  you got a byte with a leading bit=0 this is a One-Byte-UTF8-Code
* else: the starting bits of the first byte tell you the length of the code:
 - 110 = 2byte
-  1110 = 3 byte
-  1111 =  4 byte

So as you can see from the pattern above there  is never a leading zero bit at the second, third or fourth byte. A leading zero bit can only be at the start of a One-Byte-Char.

So you see: a lot of "holes" in the utf8 space.

A page where all UTF8-Codeblocks are listed is:

https://www.utf8-chartable.de/unicode-utf8-table.pl

Don't overlook the combobox with all blocks - third row from top in the box.

Winni

Title: Re: How to write all unicode chars?
Post by: winni on November 13, 2019, 10:29:45 pm
Hi!

To avoid terminal trouble I made a small procedure to show "all" of the UTF8-chars between
1 and $2666.

For speed reasons - I think - the illegal characters are not sorted out by the Lazarus UTF8 routines but show the substitute glyph  like
"  ".

Here we go:
Code: Pascal  [Select][+][-]
  1. uses ..... LazUTF8;
  2.  
  3. procedure TForm1.Button1Click(Sender: TObject);
  4. var
  5. i: integer;
  6. hex : String;
  7. utf8,tmp : string;
  8. s : string= '';
  9.  
  10. begin
  11. for i := 1 to $2666 do
  12.   begin
  13.    utf8 := UnicodeToUTF8(i);
  14.    hex := intToHex(i,4);
  15.    tmp :=  IntToStr(i)+' / '+'U+'+hex+' / ' +  utf8;
  16.    s := s+tmp+lineEnding;
  17.    if i mod 20 = 0 then
  18.      begin
  19.      showMessage (s);
  20.      s := '';
  21.   end; // mod
  22. end; // for
  23. end;                          
  24.  

The forum software is too clever and delete the substitute glyph ...

Winni
Title: Re: How to write all unicode chars?
Post by: dculp on November 13, 2019, 10:36:19 pm
Code: Pascal  [Select][+][-]
  1. program Project1;
  2. {$codepage utf8}
  3. uses windows;
  4. begin
  5.   SetConsoleOutputCP(CP_UTF8);
  6.   writeln(#$e2#$96#$ba);
  7.   readln;
  8. end.
  9.  

Under Windows 10 your code gives the desired right filled pointer for either Lucida Console or raster fonts.

For general purposes (outside of this demo) I need the crt unit. However, if crt is used then your code gives three individual characters (shown in the attached image). I assume that these are associated with the individual #$e2, #$96, and #$ba. I'm not sure why the writeln is "corrupted" by the crt since your code runs fine without the crt unit. In any case, is there is a way to output this char to the screen without using a writeln? Other thoughts? (Note - console app)

Title: Re: How to write all unicode chars?
Post by: Bart on November 13, 2019, 10:38:54 pm
IIRC the crt unit resets the console codepage upon every write it does.

Bart
Title: Re: How to write all unicode chars?
Post by: winni on November 13, 2019, 11:28:06 pm
Hi!

CRT and writeln are not a good team under this circumstances.

But

Code: Pascal  [Select][+][-]
  1. procedure ttySendStr(const s:string);

is just sending out the chars = bytes.

And wenn your whole action is done then send a

Code: Pascal  [Select][+][-]
  1. ttyFlushOutput;

to send the last chars from the buffer.

Winni
Title: Re: How to write all unicode chars?
Post by: dculp on November 14, 2019, 06:41:47 pm
winni --

Your suggested procedures are in the unix crt.pp. I'm running under Windows 7+. I couldn't find these procedures in the windows crt.pp. I searched windows crt.pp for something similar didn't see anything, even for those procedures that are only in the implementation section. Perhaps I have overlooked something? Any suggestions?
Title: Re: How to write all unicode chars?
Post by: winni on November 14, 2019, 08:35:25 pm
Hmm

Sorry, I did not recognize that you need a win solution.

What is needed from the CRT unit??
Perhaps we can find a way to get around CRT.

Winni
Title: Re: How to write all unicode chars?
Post by: dculp on November 15, 2019, 02:54:35 am
IIRC the crt unit resets the console codepage upon every write it does.

Bart

Is there any way to circumvent this? Then I could set my own codepages as needed (e.g., before and after writes). (I couldn't find anything when going through the crt.pp unit but perhaps I don't know what to look for.)

Title: Re: How to write all unicode chars?
Post by: winni on November 15, 2019, 03:34:43 am
Hi!

That's the reason why I'm asking:

What do you need so urgent from CRT??
gotoxy, ClrScr or other procedures?

I want to find out how we can get around  the CRT unit.

Winni
Title: Re: How to write all unicode chars?
Post by: Martin_fr on November 15, 2019, 04:15:38 am
Maybe a look at the begin of the crt unit can help ....
Code: Pascal  [Select][+][-]
  1. procedure SetSafeCPSwitching(Switching:Boolean);
  2. procedure SetUseACP(ACP:Boolean);
  3.  
Code: Pascal  [Select][+][-]
  1.     UseACP        : Boolean; (* True means using active process codepage for
  2.                                 console output, False means use the original
  3.                                 setting (usually OEM codepage). *)
  4.     SafeCPSwitching : Boolean; (* True in combination with UseACP means that
  5.                                   the console codepage will be set on every
  6.                                   output, False means that the console codepage
  7.                                   will only be set on Initialization and
  8.                                   Finalization *)
  9.  

I haven't done any tests, but those global vars sound like they are all you need. Though no idea if they have side effects.
Title: Re: How to write all unicode chars?
Post by: lucamar on November 15, 2019, 11:45:00 am
IIRC the crt unit resets the console codepage upon every write it does.

Is there any way to circumvent this? Then I could set my own codepages as needed (e.g., before and after writes).

This bit of code should do the trick:

Code: Pascal  [Select][+][-]
  1. Assign(Input, ''); Reset(Input);
  2. Assign(Output, ''); Rewrite(Output);

Any Read/Write from/to console should then go through the "normal" channels rather than CRT's replacements.

Note, though, that I've not tested it.
Title: Re: How to write all unicode chars?
Post by: dculp on November 15, 2019, 12:00:19 pm
Martin_fr --

This seems like exactly what I need.

After searching through all files on my Lazarus distribution (2.0.2, FPC 3.0.4), I couldn't find either  SetSafeCPSwitching or SetUseACP. However, I did find them here (
https://github.com/graemeg/freepascal/blob/master/packages/rtl-console/src/win/crt.pp) and a discussion here (https://bugs.freepascal.org/view.php?id=32558). The discussion from the last link is the exact problem that I was ultimately trying to resolve (writing extended ASCII chars under Windows 10). I have not read through the entire discussion but I see that it was resolved on 2017-12-14 but not scheduled for release until v.3.2.0 (so not in the current release 3.0.4).

I'll try this revised crt.pp.
Title: Re: How to write all unicode chars?
Post by: Martin_fr on November 15, 2019, 02:31:11 pm
After searching through all files on my Lazarus distribution (2.0.2, FPC 3.0.4), I couldn't find either  SetSafeCPSwitching or SetUseACP.

They are part of fpc. If you used the official installer, then that is in a subfolder of your install.

Anyway, you can always use code navigation.
"uses crt;"
ctrl left mouse click on crt.
Title: Re: How to write all unicode chars?
Post by: lucamar on November 15, 2019, 02:46:32 pm
After searching through all files on my Lazarus distribution (2.0.2, FPC 3.0.4), I couldn't find either  SetSafeCPSwitching or SetUseACP.

They are part of fpc. If you used the official installer, then that is in a subfolder of your install.

Not in FP 3.0.4; they'll be in 3.2
Title: Re: How to write all unicode chars?
Post by: dculp on November 15, 2019, 03:02:34 pm
Quote
They are part of fpc. If you used the official installer, then that is in a subfolder of your install.

Yes, I can find the crt.pp unit in my Lazarus folder path. However, my crt.pp unit has neither SetSafeCPSwitching nor SetUseACP (see short intro code comparisons below). I had downloaded Lazarus 2.0.2 (FPC 3.0.4; Windows 32 bits version) from the Lazarus website and then installed directly from the download.

My crt.pp --
Code: Pascal  [Select][+][-]
  1. unit crt;
  2.  
  3. interface
  4.  
  5. {$i crth.inc}
  6.  
  7. procedure Window32(X1,Y1,X2,Y2: DWord);
  8. procedure GotoXY32(X,Y: DWord);
  9. function WhereX32: DWord;
  10. function WhereY32: DWord;
  11.  
  12. implementation
  13.  
  14. uses
  15.   windows;
  16.  

Updated crt.pp from https://github.com/graemeg/freepascal/blob/master/packages/rtl-console/src/win/crt.pp --
Code: Pascal  [Select][+][-]
  1. unit crt;
  2.  
  3. interface
  4.  
  5. {$i crth.inc}
  6.  
  7. procedure SetSafeCPSwitching(Switching:Boolean);
  8. procedure SetUseACP(ACP:Boolean);
  9. procedure Window32(X1,Y1,X2,Y2: DWord);
  10. procedure GotoXY32(X,Y: DWord);
  11. function WhereX32: DWord;
  12. function WhereY32: DWord;
  13.  
  14. implementation
  15.  
  16. {$DEFINE FPC_CRT_CTRLC_TREATED_AS_KEY}
  17. (* Treatment of Ctrl-C as a regular key ensured during initialization (SetupConsoleInput). *)
  18.  
  19. uses
  20.   windows;
  21.  

Title: Re: How to write all unicode chars?
Post by: dculp on November 15, 2019, 06:13:48 pm
After searching through all files on my Lazarus distribution (2.0.2, FPC 3.0.4), I couldn't find either  SetSafeCPSwitching or SetUseACP.

They are part of fpc. If you used the official installer, then that is in a subfolder of your install.

Not in FP 3.0.4; they'll be in 3.2

Is there any way to use the new crt unit? I tried just copying it to D:\Lazarus_32bit_2.0.x\fpc\3.0.4\source\packages\rtl-console\src\win but compiling a simple test program (below) gave an error "Identifier not found" for both SetUseACP and SetSafeCPSwitching.

I then tried deleting crt.ppu in D:\Lazarus_32bit_2.0.x\fpc\3.0.4\units\i386-win32\rtl-console but then got the error "Write_ASCII_extended_chars_10a.pas(11,1) Fatal: Cannot find crt used by Write_ASCII_extended_chars_10a. Make sure all ppu files of a package are in its output directory. ppu in wrong directory=D:\Lazarus_32bit_2.0.x\fpc\3.0.4\units\i386-win32\rtl-console\crt.ppu..".

I had assumed that the required crt.ppu would be created during compilation of the test program but I couldn't find it anywhere.

How can I get the correct crt.ppu for the new crt.pp? Can it just be uploaded (if it exists) and copied to the required folder? What else would I need to do?

Any idea on when FP 3.2 will be released? (Since this issue with the extended ASCII chars was resolved almost two years ago, I might have hoped that the fix would have been incorporated by now.)

Code: Pascal  [Select][+][-]
  1. program Write_ASCII_extended_chars_10a;
  2. uses
  3.    crt;
  4.  
  5. (* Also tried explicit path -
  6. uses
  7.    crt in 'D:\Lazarus_32bit_2.0.x\fpc\3.0.4\source\packages\rtl-console\src\win\';
  8. *)
  9.  
  10. begin
  11. clrscr; // just to test that the crt is actually being used
  12. SetUseACP(true);
  13. SetSafeCPSwitching(false);
  14. end.
  15.  
Title: Re: How to write all unicode chars?
Post by: Martin_fr on November 15, 2019, 06:49:52 pm
First of all, sorry I did not realize I was looking at my 3.0.2 install...

The proper way, install 3.2

The "maybe works" way: Copy the crt.pas file into your project.
However, if you use any other fcp-unit, or fpc/laz-package that uses crt, then that will not work.

I don't know if anything else in the rtl uses crt. If so, then you need to install 3.2. Or do a custom build of the entire 3.0.4


Title: Re: How to write all unicode chars?
Post by: dculp on November 15, 2019, 07:08:52 pm
Martin_fr --

I was able to compile by adding crt.pp to the Project Inspector files (see image) and then adding the path to crth.inc (required included file for crt.pp) in the project paths. This gave crt.ppu in subfolder lib\i386-win32. Perhaps this is what you have suggested.

Where can I find v.3.2 for Windows 32bit? I didn't see it on the Lazarus site. (Note - I don't know anything about compiling FPC if this is required.)
Title: Re: How to write all unicode chars?
Post by: Martin_fr on November 15, 2019, 09:42:03 pm
FPC 3.2 is still beta. It has not yet been released. And it is still changing.

A download of a recent beta build can be found here: https://sourceforge.net/projects/lazarus-snapshots/files/
Go for  lazarus-2.0.6-62131-fpc-3.2.0-beta-43271  (43271 is the svn revision of fpc svn fixes branch)
You can install it as "secondary install" => check the checkbox, choose a new install dir,  and for config specify an empty folder (can be a sub-folder of the new install dir)



But you should be fine with the copied file.
Path to inc is ok. Alternative copy the inc too.

This works, because you deleted the original crt.ppu.
If you want to keep this, you can rename the copy (including the unit name in the source), and use the unit by its new name. (should work / not tested)
TinyPortal © 2005-2018