Lazarus

Miscellaneous => Suggestions => LCL => Topic started by: theo on August 24, 2007, 11:26:50 am

Title: ANSI Chars GTK2 SVN
Post by: theo on August 24, 2007, 11:26:50 am
I'm not sure if this is a bug or if I'm missing something.

In 0.9.22 GTK2 this showed for example german umlauts correctly:

Code: [Select]
procedure TForm1.FormPaint(Sender: TObject);
var OText:String;
begin
  OText:= 'aou'+#228#246;
  Canvas.TextOut(10,10,UTF8Encode(OText));
end;  


This shows "aouäö" in the stable version.
In SVN Version (like 11856M) it shows "aou??"

Bug or am I missing something?

EDIT:
Just found out that it works in SVN Version, if I declare
var OText:WideString;

But
Canvas.TextOut(10,10,ANSIToUTF8(OText));
with AnsiString type doesn't.

However the behaviour changed from 0.9.22 to SVN.
Title: RE: ANSI Chars GTK2 SVN
Post by: felipemdc on August 24, 2007, 12:32:03 pm
Gtk 2 requires the use of UTF-8

I am sure that #228#246 doesn't represent äö under utf-8 because each of this letters would require 2 bytes.

So I would say that the gtk2 interface works more correctly on svn.

Use an external editor to input the characters instead of hard-coding their values, and don't forget to save without the bom marker.
Title: Re: RE: ANSI Chars GTK2 SVN
Post by: theo on August 24, 2007, 02:07:33 pm
Quote from: "sekel"

So I would say that the gtk2 interface works more correctly on svn.


Yes.  As i said (see EDIT:) it works like this (see WideString instead of String).

Code: [Select]
procedure TForm1.FormPaint(Sender: TObject);
var OText:WideString;
begin
  OText:= 'aou'+#228#246;
  Canvas.TextOut(10,10,UTF8Encode(OText));
end;  


It's the ANSIToUTF8 which does not afaics.
Unless I don't understand what it is supposed to do.
But the ANSI Table http://www.mmvisual.de/Hilfe/BinTerm/T044.htm
should contain umlauts etc. so I expected the function to convert them to UTF-8.

But I can do now what I wanted. So it's not a problem for me.
Title: RE: Re: RE: ANSI Chars GTK2 SVN
Post by: Vincent Snijders on August 24, 2007, 03:34:32 pm
Some remarks:
* Ansi is a standard for the values 0..127.
* I don't knwo what the compiler does with #228. It may be depend on the current code page, maybe it just assume it to be a widechar. I think this behavior also has changed between fpc 2.0.4 and 2.3.1.
Title: Re: RE: ANSI Chars GTK2 SVN
Post by: felipemdc on August 24, 2007, 04:23:45 pm
Quote from: "theo"
It's the ANSIToUTF8 which does not afaics.
Unless I don't understand what it is supposed to do.
But the ANSI Table http://www.mmvisual.de/Hilfe/BinTerm/T044.htm
should contain umlauts etc. so I expected the function to convert them to UTF-8.


It is tricker then that.

Let's see this code here:

var
  OText:String;
begin
  OText:= 'aou'+#228#246;
  Canvas.TextOut(10,10,ANSIToUTF8(OText));

Supposing your source file doesn't have a UTF-8 BOM Marker, the Free Pascal Compiler will interpret your file as being ISO encoded.

It will then convert the string which is being assigned in OText into UTF-16, and store that on your executable.

At run-time your executable will get this UTF-16 string and attempt to convert it to what it things the operating system has.

So you don't have any guarantee that you are receiving an ansi string on OText to start with. What you type on the source code and what is stored on the executable are 2 different things.

There are many combinations of possibilities which would explain what is going on here.

For me this is just another example of how solving bug 9305 on Free Pascal would make things easier when dealing with encodings:

http://www.freepascal.org/mantis/view.php?id=9305
Title: RE: Re: RE: ANSI Chars GTK2 SVN
Post by: theo on August 24, 2007, 04:57:17 pm
I'm confused..... ;-)

So what am I supposed to do if I want to show ANSI Char #182 (¶) ?
I thought the ANSI Table is a language independent standard and going from 0..255, and is ASCII compatible from 0..127.

Afaics, this should be also compatible with Unicode 0080..00ff
http://jrgraphix.net/research/unicode_blocks.php?block=1

Please tell me if I'm wrong.
Title: Re: RE: Re: RE: ANSI Chars GTK2 SVN
Post by: felipemdc on August 24, 2007, 06:58:03 pm
Quote
.....Please tell me if I'm wrong.


You are right, but you are forgetting that what you type on the source code isn't what is stored on the executable.

Free Pascal supports only UTF-16 encoded strings on the executable. And if you write your source code on something else it automatically converts to utf-16. To get anything else at run-time a conversion is necessary.

And there is currently no way to force at compile time to what the string will be converted to. Currently this depends on the Runtime library to choose it. On some operating systems it always converts to iso, on some tryes to detect the system encoding, etc. And that also varies with the compiler version.

This would be solved by implementing the feature request I mentioned.
.
Title: RE: Re: RE: Re: RE: ANSI Chars GTK2 SVN
Post by: theo on August 24, 2007, 07:32:51 pm
Humm, not sure if I really understand the problem you mentioned.

But would you agree, that it is safe to do it like this to reliably show ¶ on GTK2

procedure TForm1.FormPaint(Sender: TObject);
var OText:WideString;
begin
  OText:=WideChar($00B6);
  Canvas.TextOut(10,10,UTF8Encode(OText));
end;  

I can hardly imagine how this could be misinterpreted if the widgetset supports unicode (resp. UTF-8)
Right or wrong?
Title: RE: Re: RE: Re: RE: ANSI Chars GTK2 SVN
Post by: felipemdc on August 24, 2007, 09:22:17 pm
The problems I described only apply if OText was a ansistring, like on the first example.

They don't apply to this last code you posted, but here I would see another problem. You are manually setting a value to the widestring which isn't a valid widechar. I don't think this last code you posted will work.

You should put the UTF-16 value for the character you want on this statement:

OText:=WideChar($00B6);

The automatic conversion I was talking about should only apply to constant strings, and possibly (I don't know for sure) to constant strings mixed with pure chars.

There won't be automatic conversion if you set your own numeric value for a WideChar and set it to a WideString.
Title: Re: RE: Re: RE: Re: RE: ANSI Chars GTK2 SVN
Post by: theo on August 24, 2007, 10:09:33 pm
Quote from: "sekel"
You are manually setting a value to the widestring which isn't a valid widechar. I don't think this last code you posted will work.


Yes, it works. I always test such simple code before posting ;-)
Why shouldn't this be a valid Widechar?

Even more interesting: This same last code shows the ¶ even on GTK1!!

Where this:

var txt:String;
begin
txt:= #182;
Canvas.TextOut(10,10,txt);  

On GTK1 only shows a question mark.

My $LANG: de_DE.UTF-8

I'm more and more confused..... ;-)
Thanks for your time!

P.S. Talking about SVN 11856

P.S. 2 How would a UTF-16 Value look different?
Title: Re: RE: Re: RE: Re: RE: ANSI Chars GTK2 SVN
Post by: felipemdc on August 25, 2007, 12:14:09 am
Quote from: "theo"
Why shouldn't this be a valid Widechar?


I don't have much experience with utf-16, so this was just a guess, but I looked at wikipedia and it seams that I was wrong.

http://en.wikipedia.org/wiki/UTF-16/UCS-2
Title: RE: Re: RE: Re: RE: Re: RE: ANSI Chars GTK2 SVN
Post by: theo on August 25, 2007, 11:39:38 am
As I understand it, a WideChar is simply a double byte character and there's no magic about it ($0..$FFFF)
It represents one of these glyphs: http://jrgraphix.net/research/unicode_blocks.php?block=0.
In your pascal code, you can treat it the same way you would AnsiString.
But you have to convert from/to UTF-X if reading from file or sending to the Widgetset.
This is done with UTF8Encode/Decode for example. There may be a BOM in a file.
While it's a bit wasteful to use WideString memory wise, it's easy to handle.
Right or wrong? ;-)
Title: RE: Re: RE: Re: RE: Re: RE: ANSI Chars GTK2 SVN
Post by: felipemdc on August 25, 2007, 03:33:26 pm
Right.
Title: Re: RE: Re: RE: ANSI Chars GTK2 SVN
Post by: Phil on August 26, 2007, 09:53:16 pm
Quote from: "theo"
I'm confused..... ;-)

So what am I supposed to do if I want to show ANSI Char #182 (¶) ?
I thought the ANSI Table is a language independent standard and going from 0..255, and is ASCII compatible from 0..127.


You're right, this Unicode stuff seems more complicated than it should be.

Traditionally with Delphi and its AnsiString-based VCL, few people worried about Unicode since they could pass upper-ASCII chars and see them properly displayed on their screen. But several of the widgetsets (Carbon, GTK2, Qt) require that all strings passed to the LCL be UTF8 encoded, meaning you can't send an upper-ASCII char to the LCL as a single byte. This has the benefit of full Unicode support, although as you've discovered most of the problems occur with the upper-ASCII chars (what's called the "Latin-1 Supplement").

The char you're asking about is ANSI $B6. The UTF-16 value for this char is $00B6. And the UTF-8 value for this char is $C2B6. All chars in the Latin-1 Supplement have C2 for the first byte of their UTF8 encoded forms and the ANSI value for the second byte.

One problem is that if you have existing code, you'll need to find all the places where you're sending strings to the LCL and use something like AnsiToUTF8 on them if the widgetset requires UTF8, but don't encode them for Win32 (although this will probably change someday, even with Delphi) or GTK1. With Win32 and GTK1, just pass the strings on unchanged, with any upper-ASCII chars embedded in them.

Sekel is correct that there are others complications possible, including BOM and strings embedded in the executable. One advantage to this approach, though, is that it allows to continue using the more efficient AnsiString instead of WideString.

Thanks.

-Phil
Title: RE: Re: RE: Re: RE: ANSI Chars GTK2 SVN
Post by: Vincent Snijders on August 26, 2007, 10:31:05 pm
Phil, thanks for the clear explanation.

I hope we can make the switch to UTF8 on windows too in Lazarus 0.9.26.
Title: RE: Re: RE: Re: RE: ANSI Chars GTK2 SVN
Post by: theo on August 26, 2007, 11:30:34 pm
Thanks Phil for the explanation.

There's still one thing I don't understand:

Three tests:

1. Memo1.text:=UTF8Encode(WideChar($00B6));
2. Memo1.text:=AnsiToUTF8(#$B6);
3. Memo1.text:=#$B6;    

On Lazarus 0.9.22 GTK1:
1. shows the ¶
2. shows the ¶
3. shows nothing

But on SVN 11856 GTK1:
1. shows the ¶
2. shows a ?
3. shows nothing

I'm using the same compiler for both versions. The conversion functions are defined in the FPC RTL.
Isn't this strange?
Title: Re: RE: Re: RE: Re: RE: ANSI Chars GTK2 SVN
Post by: Phil on August 27, 2007, 01:41:01 am
Quote from: "theo"
I'm using the same compiler for both versions. The conversion functions are defined in the FPC RTL.
Isn't this strange?


Check to make sure the 2 functions are returning the same strings:

var
  s : string;
begin
  s := UTF8Encode(WideChar($00B6));
  writeln(byte(s[1]), ' ', byte(s[2]));
  s := AnsiToUTF8(#$B6);
  writeln(byte(s[1]), ' ', byte(s[2]));
end.

You should get 194 ($C2) and 182 ($B6) for both.

I tested it here on Windows and OS X and everything checks out.
Title: RE: Re: RE: Re: RE: Re: RE: ANSI Chars GTK2 SVN
Post by: theo on August 27, 2007, 12:09:20 pm
I'm really getting different results from this using the same compiler:

Lazarus 0.9.22 GTK1:
194 182
194 182

SVN 11856 GTK1:
194 182
63 0

I've installed the stable version as root and the SVN version under ~/
Different environment? But I have no start-scripts. My $LANG is "de_CH.UTF-8" for both.
Title: Re: RE: Re: RE: Re: RE: Re: RE: ANSI Chars GTK2 SVN
Post by: Phil on August 27, 2007, 03:30:17 pm
Quote from: "theo"
I'm really getting different results from this using the same compiler:

Lazarus 0.9.22 GTK1:
194 182
194 182

SVN 11856 GTK1:
194 182
63 0

I've installed the stable version as root and the SVN version under ~/
Different environment? But I have no start-scripts. My $LANG is "de_CH.UTF-8" for both.


That certainly looks like a bug in the SVN AnsiToUTF8 function. One last thing to test:

s : AnsiToUTF8(Chr($B6));

Just to make sure the compiler isn't mangling the character stored in the executable.

Please post the sample code when you make your bug report on Mantis.

Thanks.
Title: RE: Re: RE: Re: RE: Re: RE: Re: RE: ANSI Chars GTK2 SVN
Post by: theo on August 27, 2007, 08:20:25 pm
I don't think so because this function is defined in the FPC sources and I have only one version of FPC installed.
Both, Stable and SVN Version of Lazarus use the same compiler and RTL/FCL.
Title: Re: RE: Re: RE: Re: RE: Re: RE: ANSI Chars GTK2 SVN
Post by: Phil on September 01, 2007, 10:41:31 pm
Quote from: "theo"
I'm really getting different results from this using the same compiler:

Lazarus 0.9.22 GTK1:
194 182
194 182

SVN 11856 GTK1:
194 182
63 0


Theo,

Did you ever solve this problem.

It appears as though these conversion functions return 63 ("?") if they encounter a character they can't convert. There shouldn't be a problem with #$B6. However, I've discovered that if you include the LCL Translations unit in your app, this uses the cwstring unit, which uses libc for wide string support (assuming LCL was compiled with DisableCWString not defined). I think there's a bug in cwstring with >127 ANSI chars.

Thanks.

-Phil
Title: RE: Re: RE: Re: RE: Re: RE: Re: RE: ANSI Chars GTK2 SVN
Post by: theo on September 01, 2007, 11:20:57 pm
> Did you ever solve this problem.
No. Because it's not a real problem for me now just a mistery, and I have no idea where to "attack" ;-)
I'll check the "cwstring" later.
Title: Re: RE: Re: RE: Re: RE: Re: RE: Re: RE: ANSI Chars GTK2 SVN
Post by: felipemdc on September 02, 2007, 12:27:25 am
Quote from: "theo"
No. Because it's not a real problem for me now just a mistery, and I have no idea where to "attack" ;-)


I would try asking on the Free Pascal mailling list. Maybe the compiler experts have an idea of what is going wrong.
TinyPortal © 2005-2018