Recent

Author Topic: [Solved] Defining Unicode Character Constants  (Read 3976 times)

ArminLinder

  • Sr. Member
  • ****
  • Posts: 316
  • Keep it simple.
[Solved] Defining Unicode Character Constants
« on: January 07, 2024, 08:18:33 pm »
Hi all,

I programmed a little 3-2-1 countdown. To display encircled numbers I have used the Windows "Wingdings" font, which has encircled numbers 0 .. 9. From the Windows "Character Map" tool I get the character codes $80 ... $89. How do I use these to initialize a string which I can pass to GUI controls to display the characters?

Please see the attachment. As you can see in the screenshot, I did somehow succeed by copying the characters from the Character Map tool via the Windows clipboard. The resulting source code is not very readable. The characters work, but the cryptic string contents seem to drive the Lazarus editor nuts, if I go through the string using the cursor keys the characters are "unsteady", meaning that the string contents changes unexpectedly depending on the cursor position. I am afraid, that the Lazarus IDE has problems too displaying those characters properly, and tehre may be more problems coming, e.g. when using the JEDI code formatter. Better use numbers to initialize them, I think.

Unfortunately, I could not figure out how define character constants this for multi-byte character formats. Shoving the character codes into WideChar or WideString (see commented line below, I passed that to WideString) gives completely different characters in the result string.

Anyone who can help?

Thnks, Armin
« Last Edit: January 12, 2024, 11:54:39 am by ArminLinder »
Lazarus 3.3.2 on Windows 7,10,11, Debian 10.8 "Buster", macOS Catalina, macOS BigSur, VMWare Workstation 15, Raspberry Pi

Thaddy

  • Hero Member
  • *****
  • Posts: 16327
  • Censorship about opinions does not belong here.
Re: Defining Unicode Character Constant
« Reply #1 on: January 08, 2024, 10:31:55 am »
 Since your problem is only with string literals, simply add {$CODEPAGE UTF8} near de top of your program file (lpr)
There is nothing wrong with being blunt. At a minimum it is also honest.

wp

  • Hero Member
  • *****
  • Posts: 12516
Re: Defining Unicode Character Constant
« Reply #2 on: January 08, 2024, 01:07:25 pm »
I could not find the WingDings font on my system, but found DingBats which contains circled numbers as well. I can confirm that the source editor seems to have issues with cursor placement when such "exotic" UTF-8 characters are used. This is what I do:
  • Declare a empty string constant with a describing name for the UTF8 letter, e.g. WHITE_ONE = '';
  • Move the cursor in the declaration between the two quotes
  • Open the Lazarus character map, search for the UTF-8 page containing the character
  • Click on the found character - this copies the character into the editor at the cursor position between the quotes.
  • I agree that at this moment only half of the character is displayed in the editor, but once I move the cursor away from the declaration, everything looks correct.
  • If it is unpractical to use constants for each character, I try to avoid editing the combined string. It is better to rewrite the string again from the start

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11982
  • FPC developer.
Re: Defining Unicode Character Constant
« Reply #3 on: January 08, 2024, 01:26:08 pm »
There is an old open bugreport open for this: https://gitlab.com/freepascal.org/lazarus/lazarus/-/issues/29071

Back then I was creating test programs for the unicode support of FPC 3.0 :-)

nanobit

  • Full Member
  • ***
  • Posts: 165
Re: Defining Unicode Character Constant
« Reply #4 on: January 08, 2024, 02:01:38 pm »
I programmed a little 3-2-1 countdown. To display encircled numbers I have used the Windows "Wingdings" font, which has encircled numbers 0 .. 9. From the Windows "Character Map" tool I get the character codes $80 ... $89. How do I use these to initialize a string which I can pass to GUI controls to display the characters?

Wingdings should be avoided, because this is not a Unicode compliant font, but has (charset = SYMBOL_CHARSET).
For example, Wingdings will show character "J" as smiley.

There is a way to use unicode-text functions with WingDings:
via Private Use Area: $F000 + bytecode. But the main problem remains:
The same string will not work with a different font, because Private Use Area glyphs are specific to font.
« Last Edit: January 08, 2024, 03:52:03 pm by nanobit »

ArminLinder

  • Sr. Member
  • ****
  • Posts: 316
  • Keep it simple.
Re: Defining Unicode Character Constant
« Reply #5 on: January 11, 2024, 02:12:15 pm »
Thanks, guys, Thaddy and Nanobit did it - together. What I actually had to do is:

Code: Pascal  [Select][+][-]
  1. unit Unit1;
  2.  
  3. {$mode objfpc}{$H+}
  4. {$CODEPAGE UTF8}
  5.  
  6. interface
  7.  
  8. ...
  9.  
  10. const
  11.   FontName1 = 'Courier';
  12.   FontName2 = 'Wingdings';
  13.   // TestText  = '€‚ƒ„…†‡ˆ‰';
  14.   TestText1  = 'äöüÄÖÜßabcdef';
  15.   TestText2  = #$F080#$F081#$F082#$F083#$F084#$F085#$F086#$F087#$F088#$F089;
  16.  
  17. ...
  18.  
  19.  

... and with some more small modifications in the onClick handlers (Button 1 display TestText1 in "Courier", Button2 display Texttext 2 in "Windings" I get a perfect result.

Doing either $CODEPAGE or $F0.. did not help, I needed to do both.

Thanks a lot!

Armin.
« Last Edit: January 12, 2024, 09:57:07 am by ArminLinder »
Lazarus 3.3.2 on Windows 7,10,11, Debian 10.8 "Buster", macOS Catalina, macOS BigSur, VMWare Workstation 15, Raspberry Pi

ArminLinder

  • Sr. Member
  • ****
  • Posts: 316
  • Keep it simple.
Re: Defining Unicode Character Constant
« Reply #6 on: January 12, 2024, 11:54:00 am »
Probably someone else reaches here, and may find these findings helpful.

Looking deeper into the whole typographic mess I came to a point where I found, as Nanobit was pointing out, that "Windings" does indeed not support Unicode character mappings, but is a leftover from the pre-Unicode era adhering to Microsofts proprietary "Windows" ("ANSI?") standard.

If so, why then do the characters display properly, if I use $codepage UTF-8, and btw, isn't UTF-8 a character encoding scheme and not a character mapping table (codepage)? I assumed, that $codepage UTF-8 could be a naming mispick, meaning "expect character literals to be specified using UTF-8 encoding", having nothing to do with codepages.

This would not explain, however, why the encircled numbers symbols do magically appear in the Unicode "Private Use Area", which is reserved for "user-specific" font customizations. What technology maps character faces taken from the Windings font into the "PUA"? And how can one know which characters are contained in the PUA?

I found this document: https://scripts.sil.org/cms/scripts/page.php?id=VendorUseOfPUA&site_id=nrsi#14086407

which explains: it is a customization Microsoft has made to make Windows/ANSI symbol fonts (which do not at all map to any Unicode codepage) accessible via Unicode strings. Windings is not a Unicode font, so the Unicode PUA contained in the Unicode "Basic" plane (code $00), is exploited. The first byte of the actual character code is therefore $00. $F0 is the code of the PUA where Microsoft has choosen to start the mapping, so the second byte is $F0. From there the codes follow the offset the character had in the original Windows/ANSI table, which is, for the encircled zero symbol, $80 as displayed by the Windows character map tool.

The complete code for character "encircled 0" is therefore $00$F0$80, or, simplified, $F080, the constant given by Nanobit.

For comparison: according to Unicode specs the encircled numbers are found from codepoint U+$24F5 / $E2$91$A0, for fonts which are natively "Unicode" compliant. There is no guarantee, however that a Unicode-compliant font must contain all of the zillions of characters Unicode supports, and browsing the fonts my Windows 11 has built-in I found no font which supports the encircled characters natively, but my search was quite superficial. I could have missed one.

A built-in cross-platform font would definitely be my favourite pick, if there was one. There are lists of cross-platform founts available (e.g. https://en.wikipedia.org/wiki/Open-source_Unicode_typefaces) but none of these seems to be included in any Windows distribution, and I am reluctant to force my end-users to mess with font distribution. So I go with Microsofts workaround for the moment, the program is not intended to work outside the Windows world.

What a mess, mission finally accomplished for just one character, sacrificing cross-platform, and there are thousands more. Tools to come to the rescue? Neither the Windows Character Map nor Lazarus built-in "insert character" facility was of much help with this issue. It took me some time to finally find a tool which is able to display the whole thing: BabelMap (https://www.babelstone.co.uk/Software/Download). See screenshot below.

Mission completed. May this save someone some time in the future.

Armin.

« Last Edit: January 12, 2024, 12:07:16 pm by ArminLinder »
Lazarus 3.3.2 on Windows 7,10,11, Debian 10.8 "Buster", macOS Catalina, macOS BigSur, VMWare Workstation 15, Raspberry Pi

Thaddy

  • Hero Member
  • *****
  • Posts: 16327
  • Censorship about opinions does not belong here.
Re: [Solved] Defining Unicode Character Constants
« Reply #7 on: January 12, 2024, 12:57:41 pm »
The wingdings font was created for, ahum, Windows 1! (although not yet as truetype font format. That came with Windows 3.1)
With it you are able to create GUI like interfaces that are actually still plain DOS... I played a lot with it, back in the days, although I never had Windows 1.
See the screenshot, that GUI like interface is completely done with wingdings.
You can also see wingdings is a monospace font.
Also note such interfaces are severely code page restricted to cp850/cp427.
But there is a rudimentary canvas.
« Last Edit: January 12, 2024, 01:51:21 pm by Thaddy »
There is nothing wrong with being blunt. At a minimum it is also honest.

Thaddy

  • Hero Member
  • *****
  • Posts: 16327
  • Censorship about opinions does not belong here.
Re: Defining Unicode Character Constant
« Reply #8 on: January 12, 2024, 01:56:53 pm »
Thanks, guys, Thaddy and Nanobit did it - together. What I actually had to do is:
No, you should have declared the consts as typed consts.
There is nothing wrong with being blunt. At a minimum it is also honest.

ArminLinder

  • Sr. Member
  • ****
  • Posts: 316
  • Keep it simple.
Re: [Solved] Defining Unicode Character Constants
« Reply #9 on: January 12, 2024, 02:39:28 pm »
Thaddy,

Better?

Code: Pascal  [Select][+][-]
  1. const
  2.   FontName1 = 'Courier';
  3.   FontName2 = 'Wingdings';
  4.   TestText1  = 'äöüÄÖÜßabcdef';
  5.   TestText2 : UnicodeString = #$F080#$F081#$F082#$F083#$F084#$F085#$F086#$F087#$F088#$F089;
  6.  

If yes, ok, as I see it, just writing the constant as I did leaves it to FPC what kind of actual string type it assumes. Depending on compiler switches it may be Unicode or something else.

I have my reasons why I usually don't write typed string constants, but that's a discussion for another day.

Regarding the screenshot ... wow, I thought I had seen all Windows desktops right from the beginning ... what GUI has THAT been? I saw those line-style icons for the system menu before, but was it indeed in Windows? Looks a bit like a text-shell of some kind?

[Edit: it was actually Windows 1.x, which didn't make it across the ocean in that days, at least not onto my desk :-) My Windows life began with Windows 3.x, and till now I thought that Windows 3.0 had actually been the first version.]

Armin.
« Last Edit: January 12, 2024, 06:34:11 pm by ArminLinder »
Lazarus 3.3.2 on Windows 7,10,11, Debian 10.8 "Buster", macOS Catalina, macOS BigSur, VMWare Workstation 15, Raspberry Pi

PascalDragon

  • Hero Member
  • *****
  • Posts: 5796
  • Compiler Developer
Re: Defining Unicode Character Constant
« Reply #10 on: January 12, 2024, 05:44:47 pm »
If so, why then do the characters display properly, if I use $codepage UTF-8, and btw, isn't UTF-8 a character encoding scheme and not a character mapping table (codepage)? I assumed, that $codepage UTF-8 could be a naming mispick, meaning "expect character literals to be specified using UTF-8 encoding", having nothing to do with codepages.

It's named "code page", because Windows also considers it a code page (namely CP_UTF8), though there it's considered a so called “Multi Byte Code Page”.

Fun fact: UTF-16 is also considered a code page in Windows (CP_UTF16 and CP_UTF16BE).

Thaddy

  • Hero Member
  • *****
  • Posts: 16327
  • Censorship about opinions does not belong here.
Re: [Solved] Defining Unicode Character Constants
« Reply #11 on: January 12, 2024, 06:11:41 pm »
Isn't CP_UTF16BE just UCS2? The precursor of Unicode?
Strictly two bytes and without expanding codepoints to max 4 bytes?
That is not Unicode as we know it.
« Last Edit: January 12, 2024, 06:14:56 pm by Thaddy »
There is nothing wrong with being blunt. At a minimum it is also honest.

Zoran

  • Hero Member
  • *****
  • Posts: 1886
    • http://wiki.lazarus.freepascal.org/User:Zoran
Re: [Solved] Defining Unicode Character Constants
« Reply #12 on: January 12, 2024, 11:51:38 pm »
Isn't CP_UTF16BE just UCS2? The precursor of Unicode?
Strictly two bytes and without expanding codepoints to max 4 bytes?
That is not Unicode as we know it.

I don't think so. Like UCS2 and unlike UTF8, UTF16 still suffers from endianness ambiguity. Although in UTF16 a character can be represented with more than one word (byte pair), a word can be either big or little endian.

UTF8 doesn't have this problem, as utf8 encoded string is an array of 8-bit bytes, whereas utf16 is an array of 16-bit words.

So, there are still two variants of UTF16 encoding - big and little endian.
« Last Edit: January 12, 2024, 11:56:56 pm by Zoran »

PascalDragon

  • Hero Member
  • *****
  • Posts: 5796
  • Compiler Developer
Re: [Solved] Defining Unicode Character Constants
« Reply #13 on: January 14, 2024, 07:59:38 pm »
Isn't CP_UTF16BE just UCS2? The precursor of Unicode?
Strictly two bytes and without expanding codepoints to max 4 bytes?
That is not Unicode as we know it.

It changed from UCS2 to full UTF-16 once Windows gained full UTF-16 support with Windows 2000.

 

TinyPortal © 2005-2018