Recent

Author Topic: Issues with new strings of FPC trunk  (Read 27040 times)

wp

  • Hero Member
  • *****
  • Posts: 11858
Re: Issues with new strings of FPC trunk
« Reply #30 on: July 02, 2015, 10:57:42 pm »
Quote
in this case it can be anything and everything under the sun. We are talking about the a library that can be used all over the world to read and write spreadsheets they may contain from ansi text to utf8/utf16 text. In most cases he has no control over the input charset.
No, it's not that bad: xlsx and ods files are utf8 anyway, xls of Excel97 contains texts as widestrings, the older Excel5 and 2 files do have ansistrings, but they contain a record specifying the encoding. The only issue is with csv files but here I provide a parameter record where users can specify the codepage.

My general observations converting fpspreadsheet from my development environment (Laz trunk + fpc 2.6.4) to fpc-trunk have been very positive: practically everything was working correctly without any change, there's only that formula issue which i mentioned above, and even this one goes away if I compile with UTF8RTL enabled.
« Last Edit: July 02, 2015, 11:43:07 pm by wp »

taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: Issues with new strings of FPC trunk
« Reply #31 on: July 02, 2015, 11:44:49 pm »
Quote
in this case it can be anything and everything under the sun. We are talking about the a library that can be used all over the world to read and write spreadsheets they may contain from ansi text to utf8/utf16 text. In most cases he has no control over the input charset.
No, it's not that bad: xlsx and ods files are utf8 anyway, xls of Excel97 contains texts as widestrings, the older Excel5 and 2 files do have ansistrings, but they contain a record specifying the encoding. The only issue is with csv files but here I provide a parameter record where users can specify the codepage.

My general observations converting fpspreadsheet from my development environment (Laz trunk + fpc 2.6.4) to fpc-trunk have been very positive: practically everything was working correctly without any change, there's only that formula issue which i mentioned above, and even this one goes away if I compile with UTF8RTL enabled.
do more tests. Since the ansistring type is converted to utf8 on lcl you have problems not seen yet.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

ChrisF

  • Hero Member
  • *****
  • Posts: 542
Re: Issues with new strings of FPC trunk
« Reply #32 on: July 03, 2015, 06:30:44 pm »
I agree with taazz. If you are using Ansi text data somewhere, you probably have some issues with them.

First of all, when 'UTF-8 in RTL' is activated, the usual FPC/LCL conversion functions are -most probably- no more working: AnsiToUTF8 / UTF8ToAnsi and SysToUTF8 / UTF8ToSys.

According to the documentation (see http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#RTL_with_default_codepage_UTF-8), these functions must be replaced by:  WinCPToUTF8 / UTF8ToWinCP.

So, add LazUTF8 in the 'uses' clause (if not already present), and replace all these old conversion functions. That doesn't seem a big deal, after all.

But it's not as simple ...

Hereafter, an illustration of what I'm beginning to call the Ansi nightmare ...


1/ The test program

Preliminary: this sample is intended to work only with FPC 3.0+.

Start a new Lazarus project with an edit box, a check box and a push button (I'm attaching such a sample project).

Here is the interesting part of the code:
Code: [Select]
uses
  LazUTF8, Windows;

procedure TForm1.Button1Click(Sender: TObject);
var s: string;
var ws: widestring;
var UnicodeEnabledOS: boolean;
begin
  UnicodeEnabledOS := CheckBox1.Checked;
  //
  s := UTF8ToWinCP(Edit1.Text);
  if UnicodeEnabledOS then
    begin
      ws := widestring(s);
      MessageBoxW(0,PWideChar('Hello '+ws+' !'),'Greetings W',0);
    end
  else
    MessageBoxA(0,PChar('Hello '+s+' !'),'Greetings A',0);
end;

The code might seem a bit 'strange'; in a real code, it would be coded differently.

But for my demonstration purposes, I've been of course forced to carefully choose my instructions, their orders, ... Anyway, even if it's 'strange', it's certainly not incorrect.


2/ Tests with No 'UTF-8 in RTL'

Everything is OK.

You can verify it when running the project, by introducing some non ASCII characters in the edit box (like "Ändern", for instance) and pressing the push button. Both for the Ansi and the Unicode cases, the message box text is OK.


3/ Tests with 'UTF-8 in RTL' activated

This time, it's OK with the Ansi API (by chance, I'd wished to say), but not for the Unicode (i.e. wide) API. Though the recommended "UTF8ToWinCP" function has been used, and though it returns the correct data.

Ahhh, yes ! Currently, the LCL is lying for (almost ?) all of its conversion functions. Concretely, it means that the code page returned for the function result is incorrect (as for UTF8ToConsole earlier in this topic).

It's not a a problem. It might be fixed in the future in the LCL, but nowadays we can still fix it by ourselves easily. Just add the following instruction:
Code: [Select]
...
  s := UTF8ToWinCP(Edit1.Text);
  SetCodepage(rawbytestring(s), GetACP, false);     //  <--- Add me
  if UnicodeEnabledOS then
...

The Unicode case is now working properly: great, we've finally made it !

Ehhh, wait a minute... The Ansi case is no more working !

We are supposed to have just 'fixed' a potential issue. The data returned by UTF8ToWinCP function are OK; we have now set the correct code page for these data, and it's not working. Uhhh ?

Unfortunately, considering the compiler point of view I'm afraid it's quite "logical and correct", if you look closer at the source code. Of course, as I've written before, I've carefully chosen my sample ...

Just a clue for the explanation of this last problem. Make a test by just modifying the Ansi API call:
Code: [Select]
...
//    MessageBoxA(0,PChar('Hello '+s+' !'),'Greetings A',0);  //  Modify
    MessageBoxA(0,PChar(s),'Greetings A',0);                  //    me
...

And yes, it's working correctly again (the Ansi case, I mean), and it's also "logical and correct".


4/ My conclusions

Do use only UTF8 text data when activating the 'UTF-8 in RTL' LCL option. This concerns by definition only the 1-byte string variables ("string", "ansistring", "utf8string", ...), of course.
or
Don't activate the 'UTF-8 in RTL' LCL option, if you plan to use ANSI text data, etc.

Unless, you're perfectly familiar (this is not my case) with the Free Pascal code page support, how it's used in the LCL, and all the consequences of this using concerning the 1-byte strings.

.
« Last Edit: July 03, 2015, 09:47:46 pm by ChrisF »

wp

  • Hero Member
  • *****
  • Posts: 11858
Re: Issues with new strings of FPC trunk
« Reply #33 on: July 03, 2015, 10:05:04 pm »
Chris, thank you for these detailed analyses and recommendations. It may take some time until I understand everything...

In the meantime I could pin-point my fpspreadsheet issue by querying the code pages along the lifetime of the strings under investigation ("StringCodepage()"): The problem with non-UTF8RTL mode occurs when a non-ansi string is read from an xls8 file where it is stored as UTF-16. Immediately after reading I do a conversion to UTF8 by calling UTF8Encode:

Code: [Select]
function TsSpreadBIFF8Reader.ReadString_8bitLen(AStream: TStream): String;
const
  HAS_8BITLEN = true;
var
  wideStr: widestring;
begin
  wideStr := ReadWideString(AStream, HAS_8BITLEN);  // is CP1200 (UTF-16)
  Result := UTF8Encode(wideStr);      // Result gets CP 65001 (UTF-8)
end;

Since this function result is used as an argument of a spreadsheet function it later has to be enclosed by quotation marks (that's an Excel requirement), i.e. I am calling
Code: [Select]
var
  s: String;
  arg: String;
...
  s := ReadString_8bitLen(AStream);  // this is CP65001 - see above
  arg := '"' + s + '"';              // '"' is CP_ACP, arg gets CP1252
While I am still believing the arg is a utf8string, this observation means that the utf8string has been destroyed, it has been converted to a CP-1252 string. Why the hell is this happening?

In total: Very mysterious! Chris, I think I am experiencing the same ansi nightmare by myself.

Let me repeat my configuration: Laz 1.5, fpc 3.1.1, -dEnableUTF8RTL and -FcUTF8 are NOT set, Windows 7 64-bit, but Laz runs as 32-bit.

ChrisF

  • Hero Member
  • *****
  • Posts: 542
Re: Issues with new strings of FPC trunk
« Reply #34 on: July 03, 2015, 10:41:14 pm »
I'm not quite sure, but looking at the last part of your source code, it seems to be "normal" to me (unless you've added {$codepage utf8}, but I guess you don't).

Even if [s := ReadString_8bitLen(AStream)] is UTF8 encoded, [arg :='"' + s + '"'] is IMHO always ANSI encoded (which seems to be what you've found).

As I've already written (sorry, if you already know that):

non-UTF8RTL
2 types of encoding by default
. string = ansistring : ANSI encoding (i.e. CP_ACP=0 -> GetACP = 1252 in Europa),
. utf8string : UTF8 encoding (i.e. CP_UTF8 = 65001).

UTF8RTL
only 1 type of encoding by default
. string = ansistring = utf8string : UTF8 encoding (because CP_ACP=0  now -> CP_UTF8).


So, back to your sample:

- s is a string (ANSI encoded by default), which you use to store UTF8 data, but in fact you are lying, because s is a string = ansistring, and not an utf8string,

- arg is also a string (also ANSI encoded by default), so the result will always be ANSI encoded anyway.


I can see 2 problems in you sample (but I'm not an expert):

- you should declare s as an utf8string, if it's supposed to hold UTF8 text data,

- if you also want an UTF8 encoding for arg, also declare it as an utf8string. Because if not, whatever the code page of s, your formulae (i.e. '"' + s + '"') will be always converted to the code page of arg (which is ANSI).


Finally, here is what I mean by lying and then getting trouble with the compiler.

Sample inspired from your own code (still for non-UTF8RTL):

Code: [Select]
procedure TForm1.Button1Click(Sender: TObject);
var ws: widestring;
var s: string;
begin
  ws := 'test';
  s := UTF8Encode(ws);
  ShowMessage('CP=' + IntToStr(StringCodePage(s)));
  s := s + ' finished';
  ShowMessage('CP=' + IntToStr(StringCodePage(s)));
end;

The first value is CP_UTF8=65001, but the second one is back again 1252 (if 1252 is your Windows active code page).


** EDIT **

As it's probably not very clear, I've forgotten to say that there are 2 code page values in fact, for a string variable: a static and a dynamic one (see http://wiki.freepascal.org/FPC_Unicode_support#Static_code_page).

Without knowing that, my former remarks don't make any sense ...

.
« Last Edit: July 03, 2015, 11:16:25 pm by ChrisF »

rtusrghsdfhsfdhsdfhsfdhs

  • Full Member
  • ***
  • Posts: 162
Re: Issues with new strings of FPC trunk
« Reply #35 on: July 03, 2015, 11:22:08 pm »
Does this work with UnicodeString??

ChrisF

  • Hero Member
  • *****
  • Posts: 542
Re: Issues with new strings of FPC trunk
« Reply #36 on: July 03, 2015, 11:31:43 pm »
Chris, thank you for these detailed analyses and recommendations.[...]

I've not the pretension to say they are general recommendations; just some personal conclusions I've found during my tests.
« Last Edit: July 03, 2015, 11:53:16 pm by ChrisF »

ChrisF

  • Hero Member
  • *****
  • Posts: 542
Re: Issues with new strings of FPC trunk
« Reply #37 on: July 03, 2015, 11:35:55 pm »
Does this work with UnicodeString??

Sorry, I'm not sure about by you mean by "this": all our problems, you mean ?

Anyway, unicodestring type is not 1-byte strings, but 2-bytes ones: see http://www.freepascal.org/docs-html/ref/refsu13.html.

So, they are not concerned by any code page troubles (because they don't have any code page).

Therefore, one could say that "this" is working indeed, but I'm afraid that it's not applicable in our cases.
« Last Edit: July 03, 2015, 11:42:46 pm by ChrisF »

wp

  • Hero Member
  • *****
  • Posts: 11858
Re: Issues with new strings of FPC trunk
« Reply #38 on: July 03, 2015, 11:45:41 pm »
Thank you, good to have your experience along my way of trying to understand these things!

fpspreadsheet consistently uses "string" which is meant to be understood as "utf8string". Suppose I replace every "string" by "utf8string". What happens if a user of the library writes a "string" into a spreadsheet cell? Or reads a cell string and assigns it to a "string"? I guess he may see the same mess, but I guess that he will see it also if I don't do the replacement.

What if some time in the future fpc will switch to utf-16 strings? Then I'll have to do the replacement again - using "string" promises better continuity.

My feeling is it would be best to avoid the non-UTF8RTL case. If I activate this mode for the fpspreadsheet packages, what happens if they are used in a non-UTF8RTL project? And another question comes up: fpc 2.6.4 does not compile with EnableUTF8RTL. So, enabling it would force users to the new fpc. Or is there a way to IFDEF it somewhere?

wp

  • Hero Member
  • *****
  • Posts: 11858
Re: Issues with new strings of FPC trunk
« Reply #39 on: July 03, 2015, 11:58:58 pm »
OMG - my head is spinning...

I modified your example from above and, using Lazarus' character map, I added some Japanese characters which definitely are not on my 1252 code page. But these characters are displayed correctly. How is that?

Code: [Select]
procedure TForm1.Button1Click(Sender: TObject);
var
  ws: widestring;
  s: string;
begin
  ws := 'test';
  s := UTF8Encode(ws);
  ShowMessage(s + #13 + 'CP=' + IntToStr(StringCodePage(s)));
  s := s + ' finished ァィイゥウエ';
  ShowMessage(s + #13 + 'CP=' + IntToStr(StringCodePage(s)));
end;

ChrisF

  • Hero Member
  • *****
  • Posts: 542
Re: Issues with new strings of FPC trunk
« Reply #40 on: July 04, 2015, 12:09:20 am »
fpspreadsheet consistently uses "string" [...]

That's quite a bunch of good questions, especially those concerning the link between packages and projects. Sorry, I'm afraid I don't have any pertinent answer ...

Concerning you last question about an IFDEF, I'm not sure to understand it fully.

Currently, when you use the 'set UTF-8 in RTL', as indicated it just internally adds 2 options to the project:
-FcUTF8 (in Other) for the compiler (i.e. source is UTF8 encoded)
and
-dEnableUTF8RTL (in  Additions and Overrides): a define used inside the LCL.

So you can use in your conditional instructions:

- the Free Pascal version (i.e. FULL_VERSION ) like
Code: [Select]
{$if FPC_FULLVERSION >= 020701}
  (code page support possible here...)
{$else}
 (no code page support)
{$endif}

- the 'EnableUTF8RTL' define (which only makes sense for FPC_FULLVERSION >= 020701)
Code: [Select]
{$ifdef EnableUTF8RTL}
  ( UTF-8 in RTL "activated" for the LCL)
{$else}
  ( UTF-8 in RTL not "activated" for the LCL)
{$endif}

Is that what you mean ?
« Last Edit: July 04, 2015, 12:21:27 am by ChrisF »

ChrisF

  • Hero Member
  • *****
  • Posts: 542
Re: Issues with new strings of FPC trunk
« Reply #41 on: July 04, 2015, 12:28:15 am »
[...]I modified your example from above and, using Lazarus' character map, I added some Japanese characters which definitely are not on my 1252 code page. But these characters are displayed correctly. How is that?[...]

Sorry, currently I can't see any good reason (if string means ansistring, of course). May be someone else has an answer ...

Just for being sure, you've not activated the 'UTF-8 in RTL' option during your test (and there is no '-dEnableUTF8RTL' in the Additions and Overrides section of your project) ?

*** Edit **

Same thing here with Cyrillic characters.

My naive interpretation is that the Dialogs unit and especially the ShowMessage function is still always expected UTF8 data in input (which is the case indeed, even if the code page is falsely reporting 1252).
« Last Edit: July 04, 2015, 12:48:18 am by ChrisF »

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11383
  • FPC developer.
Re: Issues with new strings of FPC trunk
« Reply #42 on: July 04, 2015, 12:55:40 am »
Chris, thank you for these detailed analyses and recommendations. It may take some time until I understand everything...

In the meantime I could pin-point my fpspreadsheet issue by querying the code pages along the lifetime of the strings under investigation ("StringCodepage()"): The problem with non-UTF8RTL mode occurs when a non-ansi string is read from an xls8 file where it is stored as UTF-16. Immediately after reading I do a conversion to UTF8 by calling UTF8Encode:

Code: [Select]
function TsSpreadBIFF8Reader.ReadString_8bitLen(AStream: TStream): String;
const
  HAS_8BITLEN = true;
var
  wideStr: widestring;
begin
  wideStr := ReadWideString(AStream, HAS_8BITLEN);  // is CP1200 (UTF-16)
  Result := UTF8Encode(wideStr);      // Result gets CP 65001 (UTF-8)
end;

Only put strings with encoding in the appropriate type. Keep in mind that while the implementation is runtime, the type system semantics are like if there
are many different ansistring types each with a different encoding.

In this case, since utf8 rtl hack is off, (string=ansistring=ansistring(0)) is not UTF8, so you need to return utf8string.

If the utf8 rtl hack is on, string has the same encoding as utf8string (but is still not the same type, but that is detected and conversions will be ok).
 
So you need to start flagging parts that are always utf8 as utf8string if utf8 rtl hack is off.

Note that since they are different types, passing them to a routine that accepts a string will trigger an utf8->ACS conversion.

ChrisF

  • Hero Member
  • *****
  • Posts: 542
Re: Issues with new strings of FPC trunk
« Reply #43 on: July 04, 2015, 01:06:57 am »
[...]  an utf8->ACS conversion. [...]

Sorry, I didn't catch what you mean with "utf8->ACS conversion". Could you explain it , please ?
« Last Edit: July 04, 2015, 01:30:38 am by ChrisF »

wp

  • Hero Member
  • *****
  • Posts: 11858
Re: Issues with new strings of FPC trunk
« Reply #44 on: July 04, 2015, 01:09:17 am »
Quote from: ChrisF
Is that what you mean ?
The problem is (I should have been more specific) that there is this condition in LazUTF8 which aborts compilation with fpc 2.6.4:

Code: [Select]
{$IF defined(EnableUTF8RTL) and (FPC_FULLVERSION<20701)}
  {$error UTF8 in RTL requires fpc 2.7.1+}
{$ENDIF}

To avoid this I had the idea to define "EnableUTF8RTL" inside an IFDEF which checks the compiler version.

That UTF8RTL leads to recompilation of all packages needed by the project probably answers the other question about package-wise UTF8RTL: no - it must be consistent throughout the project; it is not possible to "seal" UTF8RTL within one  package while having it off in other packages or the project.

 

TinyPortal © 2005-2018