Recent

Author Topic: ANSI Chars GTK2 SVN  (Read 18700 times)

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1890
ANSI Chars GTK2 SVN
« on: August 24, 2007, 11:26:50 am »
I'm not sure if this is a bug or if I'm missing something.

In 0.9.22 GTK2 this showed for example german umlauts correctly:

Code: [Select]
procedure TForm1.FormPaint(Sender: TObject);
var OText:String;
begin
  OText:= 'aou'+#228#246;
  Canvas.TextOut(10,10,UTF8Encode(OText));
end;  


This shows "aouäö" in the stable version.
In SVN Version (like 11856M) it shows "aou??"

Bug or am I missing something?

EDIT:
Just found out that it works in SVN Version, if I declare
var OText:WideString;

But
Canvas.TextOut(10,10,ANSIToUTF8(OText));
with AnsiString type doesn't.

However the behaviour changed from 0.9.22 to SVN.

felipemdc

  • Administrator
  • Hero Member
  • *
  • Posts: 3541
RE: ANSI Chars GTK2 SVN
« Reply #1 on: August 24, 2007, 12:32:03 pm »
Gtk 2 requires the use of UTF-8

I am sure that #228#246 doesn't represent äö under utf-8 because each of this letters would require 2 bytes.

So I would say that the gtk2 interface works more correctly on svn.

Use an external editor to input the characters instead of hard-coding their values, and don't forget to save without the bom marker.

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1890
Re: RE: ANSI Chars GTK2 SVN
« Reply #2 on: August 24, 2007, 02:07:33 pm »
Quote from: "sekel"

So I would say that the gtk2 interface works more correctly on svn.


Yes.  As i said (see EDIT:) it works like this (see WideString instead of String).

Code: [Select]
procedure TForm1.FormPaint(Sender: TObject);
var OText:WideString;
begin
  OText:= 'aou'+#228#246;
  Canvas.TextOut(10,10,UTF8Encode(OText));
end;  


It's the ANSIToUTF8 which does not afaics.
Unless I don't understand what it is supposed to do.
But the ANSI Table http://www.mmvisual.de/Hilfe/BinTerm/T044.htm
should contain umlauts etc. so I expected the function to convert them to UTF-8.

But I can do now what I wanted. So it's not a problem for me.

Vincent Snijders

  • Administrator
  • Hero Member
  • *
  • Posts: 2661
    • My Lazarus wiki user page
RE: Re: RE: ANSI Chars GTK2 SVN
« Reply #3 on: August 24, 2007, 03:34:32 pm »
Some remarks:
* Ansi is a standard for the values 0..127.
* I don't knwo what the compiler does with #228. It may be depend on the current code page, maybe it just assume it to be a widechar. I think this behavior also has changed between fpc 2.0.4 and 2.3.1.

felipemdc

  • Administrator
  • Hero Member
  • *
  • Posts: 3541
Re: RE: ANSI Chars GTK2 SVN
« Reply #4 on: August 24, 2007, 04:23:45 pm »
Quote from: "theo"
It's the ANSIToUTF8 which does not afaics.
Unless I don't understand what it is supposed to do.
But the ANSI Table http://www.mmvisual.de/Hilfe/BinTerm/T044.htm
should contain umlauts etc. so I expected the function to convert them to UTF-8.


It is tricker then that.

Let's see this code here:

var
  OText:String;
begin
  OText:= 'aou'+#228#246;
  Canvas.TextOut(10,10,ANSIToUTF8(OText));

Supposing your source file doesn't have a UTF-8 BOM Marker, the Free Pascal Compiler will interpret your file as being ISO encoded.

It will then convert the string which is being assigned in OText into UTF-16, and store that on your executable.

At run-time your executable will get this UTF-16 string and attempt to convert it to what it things the operating system has.

So you don't have any guarantee that you are receiving an ansi string on OText to start with. What you type on the source code and what is stored on the executable are 2 different things.

There are many combinations of possibilities which would explain what is going on here.

For me this is just another example of how solving bug 9305 on Free Pascal would make things easier when dealing with encodings:

http://www.freepascal.org/mantis/view.php?id=9305

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1890
RE: Re: RE: ANSI Chars GTK2 SVN
« Reply #5 on: August 24, 2007, 04:57:17 pm »
I'm confused..... ;-)

So what am I supposed to do if I want to show ANSI Char #182 (¶) ?
I thought the ANSI Table is a language independent standard and going from 0..255, and is ASCII compatible from 0..127.

Afaics, this should be also compatible with Unicode 0080..00ff
http://jrgraphix.net/research/unicode_blocks.php?block=1

Please tell me if I'm wrong.

felipemdc

  • Administrator
  • Hero Member
  • *
  • Posts: 3541
Re: RE: Re: RE: ANSI Chars GTK2 SVN
« Reply #6 on: August 24, 2007, 06:58:03 pm »
Quote
.....Please tell me if I'm wrong.


You are right, but you are forgetting that what you type on the source code isn't what is stored on the executable.

Free Pascal supports only UTF-16 encoded strings on the executable. And if you write your source code on something else it automatically converts to utf-16. To get anything else at run-time a conversion is necessary.

And there is currently no way to force at compile time to what the string will be converted to. Currently this depends on the Runtime library to choose it. On some operating systems it always converts to iso, on some tryes to detect the system encoding, etc. And that also varies with the compiler version.

This would be solved by implementing the feature request I mentioned.
.

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1890
RE: Re: RE: Re: RE: ANSI Chars GTK2 SVN
« Reply #7 on: August 24, 2007, 07:32:51 pm »
Humm, not sure if I really understand the problem you mentioned.

But would you agree, that it is safe to do it like this to reliably show ¶ on GTK2

procedure TForm1.FormPaint(Sender: TObject);
var OText:WideString;
begin
  OText:=WideChar($00B6);
  Canvas.TextOut(10,10,UTF8Encode(OText));
end;  

I can hardly imagine how this could be misinterpreted if the widgetset supports unicode (resp. UTF-8)
Right or wrong?

felipemdc

  • Administrator
  • Hero Member
  • *
  • Posts: 3541
RE: Re: RE: Re: RE: ANSI Chars GTK2 SVN
« Reply #8 on: August 24, 2007, 09:22:17 pm »
The problems I described only apply if OText was a ansistring, like on the first example.

They don't apply to this last code you posted, but here I would see another problem. You are manually setting a value to the widestring which isn't a valid widechar. I don't think this last code you posted will work.

You should put the UTF-16 value for the character you want on this statement:

OText:=WideChar($00B6);

The automatic conversion I was talking about should only apply to constant strings, and possibly (I don't know for sure) to constant strings mixed with pure chars.

There won't be automatic conversion if you set your own numeric value for a WideChar and set it to a WideString.

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1890
Re: RE: Re: RE: Re: RE: ANSI Chars GTK2 SVN
« Reply #9 on: August 24, 2007, 10:09:33 pm »
Quote from: "sekel"
You are manually setting a value to the widestring which isn't a valid widechar. I don't think this last code you posted will work.


Yes, it works. I always test such simple code before posting ;-)
Why shouldn't this be a valid Widechar?

Even more interesting: This same last code shows the ¶ even on GTK1!!

Where this:

var txt:String;
begin
txt:= #182;
Canvas.TextOut(10,10,txt);  

On GTK1 only shows a question mark.

My $LANG: de_DE.UTF-8

I'm more and more confused..... ;-)
Thanks for your time!

P.S. Talking about SVN 11856

P.S. 2 How would a UTF-16 Value look different?

felipemdc

  • Administrator
  • Hero Member
  • *
  • Posts: 3541
Re: RE: Re: RE: Re: RE: ANSI Chars GTK2 SVN
« Reply #10 on: August 25, 2007, 12:14:09 am »
Quote from: "theo"
Why shouldn't this be a valid Widechar?


I don't have much experience with utf-16, so this was just a guess, but I looked at wikipedia and it seams that I was wrong.

http://en.wikipedia.org/wiki/UTF-16/UCS-2

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1890
RE: Re: RE: Re: RE: Re: RE: ANSI Chars GTK2 SVN
« Reply #11 on: August 25, 2007, 11:39:38 am »
As I understand it, a WideChar is simply a double byte character and there's no magic about it ($0..$FFFF)
It represents one of these glyphs: http://jrgraphix.net/research/unicode_blocks.php?block=0.
In your pascal code, you can treat it the same way you would AnsiString.
But you have to convert from/to UTF-X if reading from file or sending to the Widgetset.
This is done with UTF8Encode/Decode for example. There may be a BOM in a file.
While it's a bit wasteful to use WideString memory wise, it's easy to handle.
Right or wrong? ;-)

felipemdc

  • Administrator
  • Hero Member
  • *
  • Posts: 3541
RE: Re: RE: Re: RE: Re: RE: ANSI Chars GTK2 SVN
« Reply #12 on: August 25, 2007, 03:33:26 pm »
Right.

Phil

  • Hero Member
  • *****
  • Posts: 2750
Re: RE: Re: RE: ANSI Chars GTK2 SVN
« Reply #13 on: August 26, 2007, 09:53:16 pm »
Quote from: "theo"
I'm confused..... ;-)

So what am I supposed to do if I want to show ANSI Char #182 (¶) ?
I thought the ANSI Table is a language independent standard and going from 0..255, and is ASCII compatible from 0..127.


You're right, this Unicode stuff seems more complicated than it should be.

Traditionally with Delphi and its AnsiString-based VCL, few people worried about Unicode since they could pass upper-ASCII chars and see them properly displayed on their screen. But several of the widgetsets (Carbon, GTK2, Qt) require that all strings passed to the LCL be UTF8 encoded, meaning you can't send an upper-ASCII char to the LCL as a single byte. This has the benefit of full Unicode support, although as you've discovered most of the problems occur with the upper-ASCII chars (what's called the "Latin-1 Supplement").

The char you're asking about is ANSI $B6. The UTF-16 value for this char is $00B6. And the UTF-8 value for this char is $C2B6. All chars in the Latin-1 Supplement have C2 for the first byte of their UTF8 encoded forms and the ANSI value for the second byte.

One problem is that if you have existing code, you'll need to find all the places where you're sending strings to the LCL and use something like AnsiToUTF8 on them if the widgetset requires UTF8, but don't encode them for Win32 (although this will probably change someday, even with Delphi) or GTK1. With Win32 and GTK1, just pass the strings on unchanged, with any upper-ASCII chars embedded in them.

Sekel is correct that there are others complications possible, including BOM and strings embedded in the executable. One advantage to this approach, though, is that it allows to continue using the more efficient AnsiString instead of WideString.

Thanks.

-Phil

Vincent Snijders

  • Administrator
  • Hero Member
  • *
  • Posts: 2661
    • My Lazarus wiki user page
RE: Re: RE: Re: RE: ANSI Chars GTK2 SVN
« Reply #14 on: August 26, 2007, 10:31:05 pm »
Phil, thanks for the clear explanation.

I hope we can make the switch to UTF8 on windows too in Lazarus 0.9.26.