Recent

Author Topic: How to write all unicode chars?  (Read 1061 times)

dculp

  • Full Member
  • ***
  • Posts: 111
How to write all unicode chars?
« on: November 13, 2019, 01:03:35 pm »
I'm trying to write all unicode chars in a particular character set. The following code works for the first 255 chars. After that it just starts recycling the same chars even though i continues to increase. I'm pretty sure the problem is with chr() that expects a byte argument but I'm not sure what to use instead.

Running under Windows 7+

Thanks,
Don C.

Code: Pascal  [Select]
  1. program Write_unicode_chars_20a;
  2. {$mode objfpc}{$H+}{$apptype console}
  3.  
  4. uses
  5.    crt, // for other purposes
  6.    sysutils; // for inttohex
  7.  
  8. var
  9.    i: integer;
  10.  
  11. begin
  12. for i:= 1 to $2666 do
  13.    begin
  14.    writeln(i:5, inttohex(i,1):5, chr(i):5);
  15.    if ((i mod 20) = 0) then readln; // pause
  16.    end;
  17. readln;
  18. end.
  19.  
  20.  
« Last Edit: November 13, 2019, 01:45:49 pm by dculp »

Thaddy

  • Hero Member
  • *****
  • Posts: 9309
Re: How to write all unicode chars?
« Reply #1 on: November 13, 2019, 01:20:23 pm »
- chr() is AnsiChar
- A UnicodeChar is at a minimum 2 bytes wide and maximum 4 bytes
- You need a console that supports unicode
So basically fix the above.
also related to equus asinus.

dculp

  • Full Member
  • ***
  • Posts: 111
Re: How to write all unicode chars?
« Reply #2 on: November 13, 2019, 01:43:01 pm »
- You need a console that supports Unicode <== don't understand
So basically fix the above.

Suggestions on how to fix?


Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 5801
    • wiki
Re: How to write all unicode chars?
« Reply #3 on: November 13, 2019, 01:44:33 pm »
You can try widechar, and convert to utf8 (if you need utf8). 
Or you need to create the byte-sequences yourself.

Btw, your code works at best for the first 128. After that it may print ansichars, but not unicode.
Unicode in can be encoded in utf8, utf16 and others. For none widestring, it should be utf8.
In Utf8 the unicode 128 is encoded as 2 bytes: char($c2) + char($80)

Also that will give you codepoints. It will not give you chars.
Unicode U+0308 https://www.fileformat.info/info/unicode/char/0308/index.htm is not a char.
It modifies the previous codepoint, and together they become a new char.
You can add many such modifiers, so there is no simple loop to print them all...

And other codepoints are surrogates, and can not be used on their own.
And there are gaps too. Not all numbers are defined. (but may get defined in future).

Also there are control chars.
https://www.fileformat.info/info/unicode/char/200f/index.htm
Will not print a char. But the rest of the line (in the same writeln statement) will be written in right to left direction. (if your console supports this)


Package LazUtils unit LazUtf8 has UnicodeToUTF8()
This will get you the codepoint as string.
You still need the terminal to support utf8.
And you still need to deal with combining/surrogates/gaps...
« Last Edit: November 13, 2019, 01:55:01 pm by Martin_fr »

dculp

  • Full Member
  • ***
  • Posts: 111
Re: How to write all unicode chars?
« Reply #4 on: November 13, 2019, 02:04:02 pm »
OK, let me simplify. Let's say that I just want to write a single character -- e.g., a right-facing solid pointer (U+25BA according Windows CharMap.exe for either Arial, Concolas, or Lucida Console fonts). How would I do this? (Short code would be helpful.)

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 5801
    • wiki
Re: How to write all unicode chars?
« Reply #5 on: November 13, 2019, 02:35:33 pm »
If you are on windows (tested win10) and if your font supports that char....

Code: Pascal  [Select]
  1. program Project1;
  2. {$codepage utf8}
  3. uses windows;
  4. begin
  5.   SetConsoleOutputCP(CP_UTF8);
  6.   writeln(#$e2#$96#$ba);
  7.   readln;
  8. end.
  9.  

The byte sequence is here https://www.fileformat.info/info/unicode/char/25BA/index.htm

winni

  • Hero Member
  • *****
  • Posts: 609
Re: How to write all unicode chars?
« Reply #6 on: November 13, 2019, 06:21:40 pm »
Hi!

You can't show the UTF8-chars with a single linear loop like you do.

Because of the definition there are "holes" which are illegal values.

Have a look at  the bit pattern of the UTF8-Code and the related hex values:

Code: Text  [Select]
  1. (* Convert table UTF8 <-> HEX =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
  2.  
  3.    UTF8                         HEX
  4.   Scalar Value                First Byte  Second Byte Third Byte  Fourth Byte
  5. A 0xxxxxxx                    0xxxxxxx
  6. B 00000yyy yyxxxxxx           110yyyyy    10xxxxxx
  7. C zzzzyyyy yyxxxxxx           1110zzzz    10yyyyyy    10xxxxxx
  8. D 000uuuuu zzzzyyyy yyxxxxxx  11110uuu    10uuzzzz    10yyyyyy    10xxxxxx
  9.  
  10. -=-=-=-=-=--=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=*)      
[/font]

This means
* If  you got a byte with a leading bit=0 this is a One-Byte-UTF8-Code
* else: the starting bits of the first byte tell you the length of the code:
 - 110 = 2byte
-  1110 = 3 byte
-  1111 =  4 byte

So as you can see from the pattern above there  is never a leading zero bit at the second, third or fourth byte. A leading zero bit can only be at the start of a One-Byte-Char.

So you see: a lot of "holes" in the utf8 space.

A page where all UTF8-Codeblocks are listed is:

https://www.utf8-chartable.de/unicode-utf8-table.pl

Don't overlook the combobox with all blocks - third row from top in the box.

Winni

« Last Edit: November 13, 2019, 06:47:53 pm by winni »

winni

  • Hero Member
  • *****
  • Posts: 609
Re: How to write all unicode chars?
« Reply #7 on: November 13, 2019, 10:29:45 pm »
Hi!

To avoid terminal trouble I made a small procedure to show "all" of the UTF8-chars between
1 and $2666.

For speed reasons - I think - the illegal characters are not sorted out by the Lazarus UTF8 routines but show the substitute glyph  like
"  ".

Here we go:
Code: Pascal  [Select]
  1. uses ..... LazUTF8;
  2.  
  3. procedure TForm1.Button1Click(Sender: TObject);
  4. var
  5. i: integer;
  6. hex : String;
  7. utf8,tmp : string;
  8. s : string= '';
  9.  
  10. begin
  11. for i := 1 to $2666 do
  12.   begin
  13.    utf8 := UnicodeToUTF8(i);
  14.    hex := intToHex(i,4);
  15.    tmp :=  IntToStr(i)+' / '+'U+'+hex+' / ' +  utf8;
  16.    s := s+tmp+lineEnding;
  17.    if i mod 20 = 0 then
  18.      begin
  19.      showMessage (s);
  20.      s := '';
  21.   end; // mod
  22. end; // for
  23. end;                          
  24.  

The forum software is too clever and delete the substitute glyph ...

Winni

dculp

  • Full Member
  • ***
  • Posts: 111
Re: How to write all unicode chars?
« Reply #8 on: November 13, 2019, 10:36:19 pm »
Code: Pascal  [Select]
  1. program Project1;
  2. {$codepage utf8}
  3. uses windows;
  4. begin
  5.   SetConsoleOutputCP(CP_UTF8);
  6.   writeln(#$e2#$96#$ba);
  7.   readln;
  8. end.
  9.  

Under Windows 10 your code gives the desired right filled pointer for either Lucida Console or raster fonts.

For general purposes (outside of this demo) I need the crt unit. However, if crt is used then your code gives three individual characters (shown in the attached image). I assume that these are associated with the individual #$e2, #$96, and #$ba. I'm not sure why the writeln is "corrupted" by the crt since your code runs fine without the crt unit. In any case, is there is a way to output this char to the screen without using a writeln? Other thoughts? (Note - console app)


Bart

  • Hero Member
  • *****
  • Posts: 3549
    • Bart en Mariska's Webstek
Re: How to write all unicode chars?
« Reply #9 on: November 13, 2019, 10:38:54 pm »
IIRC the crt unit resets the console codepage upon every write it does.

Bart

winni

  • Hero Member
  • *****
  • Posts: 609
Re: How to write all unicode chars?
« Reply #10 on: November 13, 2019, 11:28:06 pm »
Hi!

CRT and writeln are not a good team under this circumstances.

But

Code: Pascal  [Select]
  1. procedure ttySendStr(const s:string);

is just sending out the chars = bytes.

And wenn your whole action is done then send a

Code: Pascal  [Select]
  1. ttyFlushOutput;

to send the last chars from the buffer.

Winni

dculp

  • Full Member
  • ***
  • Posts: 111
Re: How to write all unicode chars?
« Reply #11 on: November 14, 2019, 06:41:47 pm »
winni --

Your suggested procedures are in the unix crt.pp. I'm running under Windows 7+. I couldn't find these procedures in the windows crt.pp. I searched windows crt.pp for something similar didn't see anything, even for those procedures that are only in the implementation section. Perhaps I have overlooked something? Any suggestions?

winni

  • Hero Member
  • *****
  • Posts: 609
Re: How to write all unicode chars?
« Reply #12 on: November 14, 2019, 08:35:25 pm »
Hmm

Sorry, I did not recognize that you need a win solution.

What is needed from the CRT unit??
Perhaps we can find a way to get around CRT.

Winni

dculp

  • Full Member
  • ***
  • Posts: 111
Re: How to write all unicode chars?
« Reply #13 on: November 15, 2019, 02:54:35 am »
IIRC the crt unit resets the console codepage upon every write it does.

Bart

Is there any way to circumvent this? Then I could set my own codepages as needed (e.g., before and after writes). (I couldn't find anything when going through the crt.pp unit but perhaps I don't know what to look for.)


winni

  • Hero Member
  • *****
  • Posts: 609
Re: How to write all unicode chars?
« Reply #14 on: November 15, 2019, 03:34:43 am »
Hi!

That's the reason why I'm asking:

What do you need so urgent from CRT??
gotoxy, ClrScr or other procedures?

I want to find out how we can get around  the CRT unit.

Winni