Lazarus
Miscellaneous => Suggestions => LCL => Topic started by: theo on August 24, 2007, 11:26:50 am
-
I'm not sure if this is a bug or if I'm missing something.
In 0.9.22 GTK2 this showed for example german umlauts correctly:
procedure TForm1.FormPaint(Sender: TObject);
var OText:String;
begin
OText:= 'aou'+#228#246;
Canvas.TextOut(10,10,UTF8Encode(OText));
end;
This shows "aouäö" in the stable version.
In SVN Version (like 11856M) it shows "aou??"
Bug or am I missing something?
EDIT:
Just found out that it works in SVN Version, if I declare
var OText:WideString;
But
Canvas.TextOut(10,10,ANSIToUTF8(OText));
with AnsiString type doesn't.
However the behaviour changed from 0.9.22 to SVN.
-
Gtk 2 requires the use of UTF-8
I am sure that #228#246 doesn't represent äö under utf-8 because each of this letters would require 2 bytes.
So I would say that the gtk2 interface works more correctly on svn.
Use an external editor to input the characters instead of hard-coding their values, and don't forget to save without the bom marker.
-
So I would say that the gtk2 interface works more correctly on svn.
Yes. As i said (see EDIT:) it works like this (see WideString instead of String).
procedure TForm1.FormPaint(Sender: TObject);
var OText:WideString;
begin
OText:= 'aou'+#228#246;
Canvas.TextOut(10,10,UTF8Encode(OText));
end;
It's the ANSIToUTF8 which does not afaics.
Unless I don't understand what it is supposed to do.
But the ANSI Table http://www.mmvisual.de/Hilfe/BinTerm/T044.htm
should contain umlauts etc. so I expected the function to convert them to UTF-8.
But I can do now what I wanted. So it's not a problem for me.
-
Some remarks:
* Ansi is a standard for the values 0..127.
* I don't knwo what the compiler does with #228. It may be depend on the current code page, maybe it just assume it to be a widechar. I think this behavior also has changed between fpc 2.0.4 and 2.3.1.
-
It's the ANSIToUTF8 which does not afaics.
Unless I don't understand what it is supposed to do.
But the ANSI Table http://www.mmvisual.de/Hilfe/BinTerm/T044.htm
should contain umlauts etc. so I expected the function to convert them to UTF-8.
It is tricker then that.
Let's see this code here:
var
OText:String;
begin
OText:= 'aou'+#228#246;
Canvas.TextOut(10,10,ANSIToUTF8(OText));
Supposing your source file doesn't have a UTF-8 BOM Marker, the Free Pascal Compiler will interpret your file as being ISO encoded.
It will then convert the string which is being assigned in OText into UTF-16, and store that on your executable.
At run-time your executable will get this UTF-16 string and attempt to convert it to what it things the operating system has.
So you don't have any guarantee that you are receiving an ansi string on OText to start with. What you type on the source code and what is stored on the executable are 2 different things.
There are many combinations of possibilities which would explain what is going on here.
For me this is just another example of how solving bug 9305 on Free Pascal would make things easier when dealing with encodings:
http://www.freepascal.org/mantis/view.php?id=9305
-
I'm confused..... ;-)
So what am I supposed to do if I want to show ANSI Char #182 (¶) ?
I thought the ANSI Table is a language independent standard and going from 0..255, and is ASCII compatible from 0..127.
Afaics, this should be also compatible with Unicode 0080..00ff
http://jrgraphix.net/research/unicode_blocks.php?block=1
Please tell me if I'm wrong.
-
.....Please tell me if I'm wrong.
You are right, but you are forgetting that what you type on the source code isn't what is stored on the executable.
Free Pascal supports only UTF-16 encoded strings on the executable. And if you write your source code on something else it automatically converts to utf-16. To get anything else at run-time a conversion is necessary.
And there is currently no way to force at compile time to what the string will be converted to. Currently this depends on the Runtime library to choose it. On some operating systems it always converts to iso, on some tryes to detect the system encoding, etc. And that also varies with the compiler version.
This would be solved by implementing the feature request I mentioned.
.
-
Humm, not sure if I really understand the problem you mentioned.
But would you agree, that it is safe to do it like this to reliably show ¶ on GTK2
procedure TForm1.FormPaint(Sender: TObject);
var OText:WideString;
begin
OText:=WideChar($00B6);
Canvas.TextOut(10,10,UTF8Encode(OText));
end;
I can hardly imagine how this could be misinterpreted if the widgetset supports unicode (resp. UTF-8)
Right or wrong?
-
The problems I described only apply if OText was a ansistring, like on the first example.
They don't apply to this last code you posted, but here I would see another problem. You are manually setting a value to the widestring which isn't a valid widechar. I don't think this last code you posted will work.
You should put the UTF-16 value for the character you want on this statement:
OText:=WideChar($00B6);
The automatic conversion I was talking about should only apply to constant strings, and possibly (I don't know for sure) to constant strings mixed with pure chars.
There won't be automatic conversion if you set your own numeric value for a WideChar and set it to a WideString.
-
You are manually setting a value to the widestring which isn't a valid widechar. I don't think this last code you posted will work.
Yes, it works. I always test such simple code before posting ;-)
Why shouldn't this be a valid Widechar?
Even more interesting: This same last code shows the ¶ even on GTK1!!
Where this:
var txt:String;
begin
txt:= #182;
Canvas.TextOut(10,10,txt);
On GTK1 only shows a question mark.
My $LANG: de_DE.UTF-8
I'm more and more confused..... ;-)
Thanks for your time!
P.S. Talking about SVN 11856
P.S. 2 How would a UTF-16 Value look different?
-
Why shouldn't this be a valid Widechar?
I don't have much experience with utf-16, so this was just a guess, but I looked at wikipedia and it seams that I was wrong.
http://en.wikipedia.org/wiki/UTF-16/UCS-2
-
As I understand it, a WideChar is simply a double byte character and there's no magic about it ($0..$FFFF)
It represents one of these glyphs: http://jrgraphix.net/research/unicode_blocks.php?block=0.
In your pascal code, you can treat it the same way you would AnsiString.
But you have to convert from/to UTF-X if reading from file or sending to the Widgetset.
This is done with UTF8Encode/Decode for example. There may be a BOM in a file.
While it's a bit wasteful to use WideString memory wise, it's easy to handle.
Right or wrong? ;-)
-
Right.
-
I'm confused..... ;-)
So what am I supposed to do if I want to show ANSI Char #182 (¶) ?
I thought the ANSI Table is a language independent standard and going from 0..255, and is ASCII compatible from 0..127.
You're right, this Unicode stuff seems more complicated than it should be.
Traditionally with Delphi and its AnsiString-based VCL, few people worried about Unicode since they could pass upper-ASCII chars and see them properly displayed on their screen. But several of the widgetsets (Carbon, GTK2, Qt) require that all strings passed to the LCL be UTF8 encoded, meaning you can't send an upper-ASCII char to the LCL as a single byte. This has the benefit of full Unicode support, although as you've discovered most of the problems occur with the upper-ASCII chars (what's called the "Latin-1 Supplement").
The char you're asking about is ANSI $B6. The UTF-16 value for this char is $00B6. And the UTF-8 value for this char is $C2B6. All chars in the Latin-1 Supplement have C2 for the first byte of their UTF8 encoded forms and the ANSI value for the second byte.
One problem is that if you have existing code, you'll need to find all the places where you're sending strings to the LCL and use something like AnsiToUTF8 on them if the widgetset requires UTF8, but don't encode them for Win32 (although this will probably change someday, even with Delphi) or GTK1. With Win32 and GTK1, just pass the strings on unchanged, with any upper-ASCII chars embedded in them.
Sekel is correct that there are others complications possible, including BOM and strings embedded in the executable. One advantage to this approach, though, is that it allows to continue using the more efficient AnsiString instead of WideString.
Thanks.
-Phil
-
Phil, thanks for the clear explanation.
I hope we can make the switch to UTF8 on windows too in Lazarus 0.9.26.
-
Thanks Phil for the explanation.
There's still one thing I don't understand:
Three tests:
1. Memo1.text:=UTF8Encode(WideChar($00B6));
2. Memo1.text:=AnsiToUTF8(#$B6);
3. Memo1.text:=#$B6;
On Lazarus 0.9.22 GTK1:
1. shows the ¶
2. shows the ¶
3. shows nothing
But on SVN 11856 GTK1:
1. shows the ¶
2. shows a ?
3. shows nothing
I'm using the same compiler for both versions. The conversion functions are defined in the FPC RTL.
Isn't this strange?
-
I'm using the same compiler for both versions. The conversion functions are defined in the FPC RTL.
Isn't this strange?
Check to make sure the 2 functions are returning the same strings:
var
s : string;
begin
s := UTF8Encode(WideChar($00B6));
writeln(byte(s[1]), ' ', byte(s[2]));
s := AnsiToUTF8(#$B6);
writeln(byte(s[1]), ' ', byte(s[2]));
end.
You should get 194 ($C2) and 182 ($B6) for both.
I tested it here on Windows and OS X and everything checks out.
-
I'm really getting different results from this using the same compiler:
Lazarus 0.9.22 GTK1:
194 182
194 182
SVN 11856 GTK1:
194 182
63 0
I've installed the stable version as root and the SVN version under ~/
Different environment? But I have no start-scripts. My $LANG is "de_CH.UTF-8" for both.
-
I'm really getting different results from this using the same compiler:
Lazarus 0.9.22 GTK1:
194 182
194 182
SVN 11856 GTK1:
194 182
63 0
I've installed the stable version as root and the SVN version under ~/
Different environment? But I have no start-scripts. My $LANG is "de_CH.UTF-8" for both.
That certainly looks like a bug in the SVN AnsiToUTF8 function. One last thing to test:
s : AnsiToUTF8(Chr($B6));
Just to make sure the compiler isn't mangling the character stored in the executable.
Please post the sample code when you make your bug report on Mantis.
Thanks.
-
I don't think so because this function is defined in the FPC sources and I have only one version of FPC installed.
Both, Stable and SVN Version of Lazarus use the same compiler and RTL/FCL.
-
I'm really getting different results from this using the same compiler:
Lazarus 0.9.22 GTK1:
194 182
194 182
SVN 11856 GTK1:
194 182
63 0
Theo,
Did you ever solve this problem.
It appears as though these conversion functions return 63 ("?") if they encounter a character they can't convert. There shouldn't be a problem with #$B6. However, I've discovered that if you include the LCL Translations unit in your app, this uses the cwstring unit, which uses libc for wide string support (assuming LCL was compiled with DisableCWString not defined). I think there's a bug in cwstring with >127 ANSI chars.
Thanks.
-Phil
-
> Did you ever solve this problem.
No. Because it's not a real problem for me now just a mistery, and I have no idea where to "attack" ;-)
I'll check the "cwstring" later.
-
No. Because it's not a real problem for me now just a mistery, and I have no idea where to "attack" ;-)
I would try asking on the Free Pascal mailling list. Maybe the compiler experts have an idea of what is going wrong.