Lazarus

Programming => General => Topic started by: Grahame Grieve on September 17, 2020, 02:28:57 pm

Title: Unicode Constants
Post by: Grahame Grieve on September 17, 2020, 02:28:57 pm
This line of code in delphi:

    dict.add('≧̸', #$2267#$0338);

Adds to dict, which is a TDictionary<String, String>, the string pair &ngE; and '≧̸'.

However compiling the same code in $mode delphi using FPC results in adding the string pair &ngE; and '??'.  But this works for other unicode characters like:

    dict.add('&ne;', #$2260);

which is '≠' in both delphi and FPC.

A bonus question: I'm somewhat confused by this. My code is behaving like the mode is delphiunicode, but it's only set to delphi. The project options syntax mode default is ObjFPC, so that's not it, and Use AnsiStrings is on. I suppose that's wrong, but why is simple unicode working? The documentation is confusing on this. Also, if my strings are unicode, what's a char? is that unicode too?
Title: Re: Unicode Constants
Post by: lucamar on September 17, 2020, 02:35:53 pm
I'm not sure (and might be completely wrong) but: Can it be that the compiler is converting your "unicode" chars to UTF-8 and while it works OK for single chars it fails for composed ones (though it shouldn't ...)?
Title: Re: Unicode Constants
Post by: Grahame Grieve on September 17, 2020, 02:40:31 pm
Well, presumably. I was kind of hoping someone who understands this could tell me how to resolve this one
Title: Re: Unicode Constants
Post by: Martin_fr on September 17, 2020, 03:21:15 pm
Conversion depends on the target: AnsiString or Utf8String.

AnsiString can not hold all the Unicode chars, so some chars will fail others will work.
Unicodestring should work.

I do not know how the Param to "add" is declared. I also do not know, if it will directly convert to that parms type, or go via default code page (which you can set somehow).

If it is declared as Utf8String then mayby Utf8String(#$2267#$0338)

Or you can either insert calls to Utf16ToUtf8 or specify the utf8 directly

https://www.fileformat.info/info/unicode/char/2267/index.htm
https://www.fileformat.info/info/unicode/char/0338/index.htm

#$E2#$89#$A7 + #$CC#$B8
Title: Re: Unicode Constants
Post by: Thaddy on September 17, 2020, 04:45:06 pm
However compiling the same code in $mode delphi using FPC
The correct mode is {$mode delphiunicode} which is the same as Delphi's 16 bit unicode.
Note that that is not very well supported by Lazarus (yet) since Lazarus is UTF8.
But in the correct mode the strings should be assignment compatible to a large extend.
Title: Re: Unicode Constants
Post by: Grahame Grieve on September 17, 2020, 10:08:10 pm
umm, I'm having trouble understanding this. I think this means that in 2020 there's still no way to actually write source that is unicode capable and consistent between FPC and delphi?

Because {$mode delphiunicode}means that my strings are not compatible with any system libraries.

Or have I misunderstood?
Title: Re: Unicode Constants
Post by: Bart on September 17, 2020, 10:20:00 pm
TDictionary<String, String> in Delphi equals to TDictionary<UnicodeString, UnicodeString> in fpc?
(TDictionary<String, String> in fpc means TDictionary<AnsiString, AnsiString>)
Then define the constants explicitely as UnicodeString.

Just an untested suggestion.

Bart
Title: Re: Unicode Constants
Post by: Grahame Grieve on September 18, 2020, 01:20:11 am
Yes, well, declaring an intermediate parameter of UnicodeString did solve the problem. Then you can assign to String anyway. So I'm not convinced that it's not a bug, but the String situation is so messy I don't really know.

There's a bug in the base FPC Json classes:

 {
    "a": "\u2267\u0338\n"
  }

Will be read as something other than ≧̸ but the debugger support for unicode is sufficiently poor and FPCunit GUI crashes when I try to copy, so I can't figure out what it actually reads it as
Title: Re: Unicode Constants
Post by: Martin_fr on September 18, 2020, 02:25:14 am
You can try to set the watch to "memory dump" (does NOT always work).
Add a string (not shortstring) as watch with typecast: ^byte(somestring)^
then go to the watch properties and select "memory dump"

For shortstring it is
  ^byte(@somestring[1])^


For WideString use ^word(somewidestring)^


In "FpDebug" you also need to set "repeat count"
Title: Re: Unicode Constants
Post by: Martin_fr on September 18, 2020, 02:28:40 am
FPCunit GUI crashes when I try to copy,
Use the right mouse, and copy from the context menu.
Title: Re: Unicode Constants
Post by: jamie on September 18, 2020, 03:31:51 am
Code: Pascal  [Select][+][-]
  1.  
  2. procedure TForm1.Button1Click(Sender: TObject);
  3. begin
  4.   Canvas.Font.Size := 30;
  5.   canvas.TextOut(0,0,JsonStringToString('\u2267 '+#9+'\u0338'));
  6. end;                                                                
  7.  

This works. there is some strange happenings when the two of those are side by side and its not json string doing it..

 Actually using an alternate first Unicode char with the u0338 still produces the error.

 So by inserting the space and then a back space displays the correct output and you can do this with wide string functions directly using the Windows.TextoutW(Canvas.Handle,0,0,wideString(….')..) and it still displays the error.

 So can't say where the real issue is here. It would appear like the chars need to be printed in singles.
Title: Re: Unicode Constants
Post by: jamie on September 18, 2020, 03:38:34 am
Here is example of printing it in singles.

Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button1Click(Sender: TObject);
  2. begin
  3.   Canvas.Font.Size := 30;
  4.   canvas.TextOut(0,0,JsonStringToString('\u2267'));
  5.   Canvas.Textout(Canvas.PenPos.x,canvas.PenPos.Y,JsonStringtoSTring('\u0338'));
  6. end;
  7.                                                                                    
  8.  

That also produces a nice output..
Title: Re: Unicode Constants
Post by: kupferstecher on September 18, 2020, 12:22:58 pm
The project options syntax mode default is ObjFPC, so that's not it, and Use AnsiStrings is on. I suppose that's wrong, but why is simple unicode working?
I'm not sure if this is clear or not: Ansi-String doesn't mean the string is limited to the ANSI-characters, but also could contain Unicode in form of UTF-8. You shouldn't need any WideString/UnicodeString or anything else to use Unicode characters, only if it's for library reasons that use UTF-16.

umm, I'm having trouble understanding this. I think this means that in 2020 there's still no way to actually write source that is unicode capable and consistent between FPC and delphi?
Why you expect 100% Delphi-compatibility?

This works for me:
Code: Pascal  [Select][+][-]
  1.   Label1.caption:= UTF8Encode(#$2267) + UTF8Encode(#$0338);
But I have to change the label's font to a unicode one (e.g. "Arial Unicode MS"), because of the composed character.

there is some strange happenings when the two of those are side by side and its not json string doing it..

 Actually using an alternate first Unicode char with the u0338 still produces the error.
As I understand it, its a composed character. The #$0338 is combined with #$2267 to one character on the display.
Title: Re: Unicode Constants
Post by: nanobit on September 18, 2020, 01:04:57 pm
This line of code in delphi:
    dict.add('&ngE;', #$2267#$0338);

If you have unicode constants in your source, you should declare {$codepage utf8} in your unit which helps with resolution at compile time. Your constant of widechars should ideally work without this, thus a bug report is appropriate.
Title: Re: Unicode Constants
Post by: PascalDragon on September 18, 2020, 03:04:13 pm
This line of code in delphi:

    dict.add('&ngE;', #$2267#$0338);

Adds to dict, which is a TDictionary<String, String>, the string pair &ngE; and '≧̸'.

However compiling the same code in $mode delphi using FPC results in adding the string pair &ngE; and '??'.  But this works for other unicode characters like:

    dict.add('&ne;', #$2260);

which is '≠' in both delphi and FPC.

The difference is that in the case of #$2267#$0338 the compiler does a compile time conversion to an AnsiString where the used encoding will be the encoding of the file (by default CP 1252, you can change this with the $CodePage directive). For the single character FPC will do a runtime conversion whereby the selected multibyte conversion codepage influences the result (in case of Lazarus that will be UTF-8).

If I remember correctly this is indeed how this behaves in newer versions of Delphi if the left side is indeed a AnsiString. As for Delphi String = UnicodeString you won't notice this, cause there won't be any conversion necessary.
Title: Re: Unicode Constants
Post by: Martin_fr on September 18, 2020, 03:18:42 pm
The difference is that in the case of #$2267#$0338 the compiler does a compile time conversion
Quote
For the single character FPC will do a runtime conversion

Why the different behaviour? Both are constant values? Both could be done at compile time?

Is this only because "What Delphi does"? How about mode ObjFpc?

Moreover (concluding from compiler warnings), if a char-constant is part of a constant expression that results in a string, then it appears to be converted at runtime?
Code: Pascal  [Select][+][-]
  1.   u := #$2267 + '';
  2.   u := #$2267 + ansistring('');
  3.   u := #$2267 + rawbytestring('');
  4.   u := #$2267 + char(' ');
  5.  
or is the added string/char first converted to widechar/string, then added, and the result converted back?
Title: Re: Unicode Constants
Post by: PascalDragon on September 18, 2020, 04:56:15 pm
The difference is that in the case of #$2267#$0338 the compiler does a compile time conversion
Quote
For the single character FPC will do a runtime conversion

Why the different behaviour? Both are constant values? Both could be done at compile time?

Is this only because "What Delphi does"? How about mode ObjFpc?

It's "what Delphi does", because the whole "code page aware string" concept is lifted from and modeled after Delphi.

Though I just noticed the following comment inside the "UnicodeChar -> AnsiString" conversion code:

Code: [Select]
                              // compiler has different codepage than a system running an application
                              // to prevent wrong codepage and data loss we are converting unicode char
                              // using a helper routine. This is not delphi compatible behavior.
                              // Delphi converts UniocodeChar to ansistring at the compile time

This was added to fix issue 21195 (https://bugs.freepascal.org/view.php?id=21195).

Moreover (concluding from compiler warnings), if a char-constant is part of a constant expression that results in a string, then it appears to be converted at runtime?
Code: Pascal  [Select][+][-]
  1.   u := #$2267 + '';
  2.   u := #$2267 + ansistring('');
  3.   u := #$2267 + rawbytestring('');
  4.   u := #$2267 + char(' ');
  5.  
or is the added string/char first converted to widechar/string, then added, and the result converted back?

These are all constant strings like in the case of #$2267#$0338 and thus are handled at compile time.
Title: Re: Unicode Constants
Post by: Remy Lebeau on September 18, 2020, 08:27:18 pm
If I remember correctly this is indeed how this behaves in newer versions of Delphi if the left side is indeed a AnsiString.

The behavior has nothing to do with the type used on the left side of the assignment.  The behavior is controlled by the {$HIGHCHARUNICODE} (http://docwiki.embarcadero.com/RADStudio/en/HIGHCHARUNICODE_directive_(Delphi)) directive instead:

Quote
When HIGHCHARUNICODE is OFF:

    All decimal #xxx n-digit literals are parsed as AnsiChar.
    All hexadecimal #$xx 2-digit literals are parsed as AnsiChar.
    All hexadecimal #$xxxx 4-digit literals are parsed as WideChar.

When HIGHCHARUNICODE is ON:

    All literals are parsed as WideChar.

As for Delphi String = UnicodeString you won't notice this, cause there won't be any conversion necessary.

In FPC, String=AnsiString in {$mode Delphi}, and String=UnicodeString in {$mode DelphiUnicode} and {$modeswitch UnicodeStrings}.
Title: Re: Unicode Constants
Post by: Grahame Grieve on September 19, 2020, 12:23:01 am
> Why you expect 100% Delphi-compatibility?

Well, I don't. Clearly there's areas where there isn't. But when it comes to something as fundamental as unicode, and as uibiquituous as string handling, then I do expect a clearly documented way to write code that delivers source code compatibility, yes.
Title: Re: Unicode Constants
Post by: PascalDragon on September 19, 2020, 04:13:46 pm
If I remember correctly this is indeed how this behaves in newer versions of Delphi if the left side is indeed a AnsiString.

The behavior has nothing to do with the type used on the left side of the assignment.  The behavior is controlled by the {$HIGHCHARUNICODE} (http://docwiki.embarcadero.com/RADStudio/en/HIGHCHARUNICODE_directive_(Delphi)) directive instead:

Quote
When HIGHCHARUNICODE is OFF:

    All decimal #xxx n-digit literals are parsed as AnsiChar.
    All hexadecimal #$xx 2-digit literals are parsed as AnsiChar.
    All hexadecimal #$xxxx 4-digit literals are parsed as WideChar.

When HIGHCHARUNICODE is ON:

    All literals are parsed as WideChar.

FPC does not support that switch. And please also see what I mentioned further down in my post after I looked at the compiler's code.

As for Delphi String = UnicodeString you won't notice this, cause there won't be any conversion necessary.

In FPC, String=AnsiString in {$mode Delphi}, and String=UnicodeString in {$mode DelphiUnicode} and {$modeswitch UnicodeStrings}.

I know that FPC supports that modeswitch, but right now it's essentially a joke and useless. Because yes, String might be set to UnicodeString then, but the whole RTL still uses AnsiString. So overriding any virtual methods of RTL classes requires you to explicitely use AnsiString instead of String thus providing even less compatibility to Delphi code than the current solution does.

> Why you expect 100% Delphi-compatibility?

Well, I don't. Clearly there's areas where there isn't. But when it comes to something as fundamental as unicode, and as uibiquituous as string handling, then I do expect a clearly documented way to write code that delivers source code compatibility, yes.


You're expecting wrong. What we do is document code so that it generates working code, not code that delivers source code compatibility to Delphi. And the Unicode related behavior is extensively documented here (https://wiki.freepascal.org/FPC_Unicode_support).
TinyPortal © 2005-2018