Recent

Author Topic: Issues with new strings of FPC trunk  (Read 27134 times)

ChrisF

  • Hero Member
  • *****
  • Posts: 542
Re: Issues with new strings of FPC trunk
« Reply #15 on: July 02, 2015, 02:32:04 am »
This is the same case as 2.6.4 all code in lazarus IDE is considered utf8. Why it works on 2.6.4 on windows and it does not work on 3.1.1 on the same windows installation?


Free Pascal 2.6.x doesn't really care about 1-byte strings:  string, ansistring, utf8string ... When I say it doesn't care, I mean it doesn't care about the data inside these types of string variables. You (the LCL) have to do the job of "interpreting", converting, etc. these data.


But with Free Pascal 3.0+, it DOES care about the data. I mean the text data inside an utf8string is really supposed to be UTF8 encoded: like "C3 84 6E 64 65 72 6E" for "Ändern", for instance. Same as for another kind of encoding.

Because this time, Free Pascal 3.0+ will also do the job of "interpreting", converting, etc. these data.

taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: Issues with new strings of FPC trunk
« Reply #16 on: July 02, 2015, 02:39:31 am »
This is the same case as 2.6.4 all code in lazarus IDE is considered utf8. Why it works on 2.6.4 on windows and it does not work on 3.1.1 on the same windows installation?


Free Pascal 2.6.x doesn't really care about 1-byte strings:  string, ansistring, utf8string ... When I say it doesn't care, I mean it doesn't care about the data inside these types of string variables. You (the LCL) have to do the job of "interpreting", converting, etc. these data.


But with Free Pascal 3.0+, it DOES care about the data. I mean the text data inside an utf8string is really supposed to be UTF8 encoded: like "C3 84 6E 64 65 72 6E" for "Ändern", for instance. Same as for another kind of encoding.

Because this time, Free Pascal 3.0+ will also do the job of "interpreting", converting, etc. these data.
not writeln. Writeln is suppose to take what it was given and output it as is. Keep in mind that although this is compile time compiler magic function it has to be used with files, consoles and all other modes.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

ChrisF

  • Hero Member
  • *****
  • Posts: 542
Re: Issues with new strings of FPC trunk
« Reply #17 on: July 02, 2015, 03:14:39 am »
not writeln. Writeln is suppose to take what it was given and output it as is.

Well, honestly I just don't know; and I'm sure you know it better than I do.

My interpretation is that it's not really relative to writeln: it's just that writeln inputs are supposed to be of OEM type, so the conversion is done anyway, before the writeln call. But it's only my interpretation...

(*edit* Or else after, when the Window API are called)


As a complement for the none utf8rtl case, the wrong result for UTF8ToConsole is once gain due to an incorrect code page for the result of this Lazarus function.

Still my same test code, with an additional UTF8ToConsoleExt function, which is almost doing nothing than calling UTF8ToConsole:
Code: [Select]
program project1;

{$mode objfpc}{$H+}
{$CODEPAGE UTF8}

uses
  {$IFDEF UNIX}{$IFDEF UseCThreads}
  cthreads,
  {$ENDIF}{$ENDIF}
  Classes, SysUtils, LazUTF8
  { you can add units after this };

type
  stroem = type ansistring(CP_OEMCP);

function dispcode(const s: rawbytestring): string;
var i: integer;
begin
  result:='  [ Code: ';
  for i:=1 to length(s) do
    result:=result+inttostr(ord(s[i]))+' ';
  result:=result+' ]';
end;

function UTF8ToConsoleExt(Const S: utf8string): rawbytestring;
begin
  result := UTF8ToConsole(S);
  setCodePage(result, CP_OEMCP, false);
end;

var
  s: string; //utf8String;
  soe: stroem;
begin
  s := 'Ändern';
  WriteLn(s+dispcode(s));
  WriteLn('UTF8ToConsole = ', UTF8ToConsole(s)+dispcode(UTF8ToConsole(s)));
  WriteLn('UTF8ToConsoleE= ', UTF8ToConsoleExt(s)+dispcode(UTF8ToConsoleExt(s)));
  WriteLn('UTF8ToAnsi    = ', UTF8ToAnsi(s)+dispcode(UTF8ToAnsi(s)));
  WriteLn('UTF8ToSys     = ', UTF8ToSys(s));
  WriteLn('UTF8ToWinCP   = ', UTF8ToWinCP(s));
  soe := s;
  WriteLn('oem           = ', soe+dispcode(soe));
  ReadLn;
end.


And the results:
Code: [Select]
Ändern  [ Code: 195 132 110 100 101 114 110  ]
UTF8ToConsole = Zndern  [ Code: 142 110 100 101 114 110  ]
UTF8ToConsoleE= Ändern  [ Code: 142 110 100 101 114 110  ]
UTF8ToAnsi    = Ändern  [ Code: 196 110 100 101 114 110  ]
UTF8ToSys     = Ändern
UTF8ToWinCP   = Ändern
oem           = Ändern  [ Code: 142 110 100 101 114 110  ]

As one can see, the data result of UTF8ToConsole and UTF8ToConsoleExt are identical (and also with the data of the ansistring(CP_OEMCP) variable, BTW): but the display is correct only for UTF8ToConsoleExt.

By changing the code page of the result to the "correct" one, this time Free Pascal is not fooled anymore, and so the display is OK.
« Last Edit: July 02, 2015, 03:42:50 am by ChrisF »

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11453
  • FPC developer.
Re: Issues with new strings of FPC trunk
« Reply #18 on: July 02, 2015, 08:54:52 am »
Free Pascal 2.6.x doesn't really care about 1-byte strings:  string, ansistring, utf8string ... When I say it doesn't care, I mean it doesn't care about the data inside these types of string variables.

(It actually does, but that only shows in the conversion to two byte string types, because there are no language supported 1-byte encoding conversions in 2.6.x

 

ChrisF

  • Hero Member
  • *****
  • Posts: 542
Re: Issues with new strings of FPC trunk
« Reply #19 on: July 02, 2015, 01:40:29 pm »
And finally, in case of utf8rtl, the error is coming directly from the UTF8ToConsole function in winlazutf8.inc.

UTF8ToSys is called before the Windows API CharToOEM, in order to convert the UTF8 string into an ANSI one. This call fails: see http://forum.lazarus.freepascal.org/index.php/topic,28891.0.html.

So:
- modify the UTF8ToSys(s) call to a UTF8ToWinCP(s) one,
- don't forget to change the code page of the result.

Here is the whole modified UTF8ToConsole function, which seems to work in both cases (no utf8rtl, and utf8rtl):
Code: [Select]
function UTF8ToConsole(const s: string): string;
{$ifNdef WinCE}
var
  Dst: PChar;
{$endif}
begin
  {$ifdef WinCE}
  Result := UTF8ToSys(s);
  {$else}
  {$if FPC_FULLVERSION >= 20701}
  Result := UTF8ToWinCP(s);     // Modif1: Rather than UTF8ToSys(s);
  {$else}
  Result := UTF8ToSys(s);       // Kept for compatibility
  {$endif}
  Dst := AllocMem((Length(Result) + 1) * SizeOf(Char));
  if CharToOEM(PChar(Result), Dst) then
    Result := StrPas(Dst);
  FreeMem(Dst);
  {$endif}
  {$if FPC_FULLVERSION >= 20701}
  SetCodePage(rawbytestring(result), CP_OEMCP, false); // Modif2: Added
  {$endif}
end;
« Last Edit: July 02, 2015, 02:01:03 pm by ChrisF »

wp

  • Hero Member
  • *****
  • Posts: 11916
Re: Issues with new strings of FPC trunk
« Reply #20 on: July 02, 2015, 01:45:54 pm »
Wow! Perfect!

If you don't mind I'll add a link to your post as a solution to my bugreport.

ChrisF

  • Hero Member
  • *****
  • Posts: 542
Re: Issues with new strings of FPC trunk
« Reply #21 on: July 02, 2015, 02:03:33 pm »
If you don't mind I'll add a link to your post as a solution to my bugreport.

Sorry, the first version was incomplete: I've added some conditional stuff for old versions of FPC  (i.e. up to 2.6.4).

wp

  • Hero Member
  • *****
  • Posts: 11916
Re: Issues with new strings of FPC trunk
« Reply #22 on: July 02, 2015, 03:04:16 pm »
This solves the UTF8ToConsole issue and brings me back to the actual problem that I see in fpspreadsheet. The issue occurs in the unit test "formulatests" of the Excel function "LEFT("Ändern",3)" which should return the first 3 characters of the string "Ändern"; a string mismatch is reported for the argument passed to the LEFT function if the test suite is compiled with fpc 3.1.1 (EnableUTF8RTL off). Again no problem with fpc 2.6.4 and with fpc 3.1.1 (having EnableUTF8RTL active).

Maybe I find a simple demonstration which does not need fpspreadsheet. In the meantime, I am asking for your help with the next issue:

In ChrisF's experiments, adding {$CODEPAGE UTF8} yields a considerable improvement (unless UTF8ToConsole is involved which is not the case here). Therefore, I begann adding this declaration to the units. But if I do this with the unit xlsbiff8.pas, the test program does not compile any more complaining about

Code: [Select]
xlsbiff8.pas(2521,54) Error: Call by var for arg no. 3 has to match exactly: Got "XLSBIFF8.AnsiString" expected "SYSTEM.AnsiString"
The offending line is
Code: [Select]
var
  target, bookmark: String; 
...
  SplitHyperlink(AHyperlink^.Target, target, bookmark);

"SplitHyperlink" is declared as
Code: [Select]
procedure SplitHyperlink(AValue: String; out ATarget, ABookmark: String);

Argument 3 is a string in both cases. Unit XLSBiff8 does not declare its own "ansistring"? What is going on here?

howardpc

  • Hero Member
  • *****
  • Posts: 4144
Re: Issues with new strings of FPC trunk
« Reply #23 on: July 02, 2015, 03:15:55 pm »
Perhaps you'll have to change each string declaration explicitly to "xxx: ansistring" (or even "system.ansistring")?

wp

  • Hero Member
  • *****
  • Posts: 11916
Re: Issues with new strings of FPC trunk
« Reply #24 on: July 02, 2015, 03:23:34 pm »
I fear so, but that's a no-go, probably these changes will propagate out of the package into user code.

ChrisF

  • Hero Member
  • *****
  • Posts: 542
Re: Issues with new strings of FPC trunk
« Reply #25 on: July 02, 2015, 04:37:11 pm »
[...] In ChrisF's experiments, adding {$CODEPAGE UTF8} yields a considerable improvement (unless UTF8ToConsole is involved which is not the case here). Therefore, I began adding this declaration to the units. [...]

This was only for my demonstration purposes. It's only a workaround, and I wouldn't recommend to use it in "real" code (I'm sure there is probably a lot of border-side effects with this).

Anyway, it's not working in your code because in this case, "string" is no more an "ansistring" but an "utf8string" (i.e. with code page CP_UTF8 = 65001). Only limited to the concerned unit, of course.
« Last Edit: July 02, 2015, 04:38:59 pm by ChrisF »

ChrisF

  • Hero Member
  • *****
  • Posts: 542
Re: Issues with new strings of FPC trunk
« Reply #26 on: July 02, 2015, 05:07:36 pm »
This solves the UTF8ToConsole issue and brings me back to the actual problem that I see in fpspreadsheet. The issue occurs in the unit test "formulatests" of the Excel function "LEFT("Ändern",3)" which should return the first 3 characters of the string "Ändern"; a string mismatch is reported for the argument passed to the LEFT function if the test suite is compiled with fpc 3.1.1 (EnableUTF8RTL off). Again no problem with fpc 2.6.4 and with fpc 3.1.1 (having EnableUTF8RTL active).


I don't know what your code's looks like, but as far as I've tested (only some basic tests), my recommendations would be (if the current situation doesn't evolve):

1/ Question to keep in mind when you are coding with FPC 3.0+:

 . what is the (static) code page of my string variable (any 1-byte kind of string, like "string", "ansistring", "utf8string", ...) ? Which includes: what does CP_ACP(=0) really means in my case (mainly in no-utf8rtl and utf8rtl cases) ?
 . what is the type of the text data encoding inside my string variable ?

As soon as you're lying (i.e. code page <> real text data encoding), you'll get a good chance to get in big troubles with Free Pascal, sooner or later (especially at run-time).


2/ Clearly, the new version of Lazarus (Lazarus 1.5+ with FPC 3.0+) is definitively oriented Unicode: UTF-8, for being precise. So, don't use anything else than utf8 strings in your code.

If you have to deal with "external" text data encoded differently, convert them immediately in UTF-8 just after getting them from the "external" source, or just before putting them back to the "external" source. And don't use anything than these utf8 strings in your code.


It may sound "hard", but I'm afraid that expecting no issues when porting existing code form Lazarus 1.4/Free Pascal 2.6.4 to Lazarus 1.5+/Free Pascal 3.0+ is, IMHO, unlikely . Not unless you never, ever used anything than utf8 strings and text data in your code.

However, please consider that they are only recommendations from a "Unicode newbie" ...
« Last Edit: July 02, 2015, 07:22:24 pm by ChrisF »

taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: Issues with new strings of FPC trunk
« Reply #27 on: July 02, 2015, 07:56:15 pm »
This solves the UTF8ToConsole issue and brings me back to the actual problem that I see in fpspreadsheet. The issue occurs in the unit test "formulatests" of the Excel function "LEFT("Ändern",3)" which should return the first 3 characters of the string "Ändern"; a string mismatch is reported for the argument passed to the LEFT function if the test suite is compiled with fpc 3.1.1 (EnableUTF8RTL off). Again no problem with fpc 2.6.4 and with fpc 3.1.1 (having EnableUTF8RTL active).


I don't know what your code's looks like, but as far as I've tested (only some basic tests), my recommendations would be (if the current situation doesn't evolve):

1/ Question to keep in mind when you are coding with FPC 3.0+:

 . what is the (static) code page of my string variable (any 1-byte kind of string, like "string", "ansistring", "utf8string", ...) ? Which includes: what does CP_ACP(=0) really means in my case (mainly in no-utf8rtl and utf8rtl cases) ?
 . what is the type of the text data encoding inside my string variable ?

As soon as you're lying (i.e. code page <> real text data encoding), you'll get a good chance to get in big troubles with Free Pascal, sooner or later (especially at run-time).
in this case it can be anything and everything under the sun. We are talking about the a library that can be used all over the world to read and write spreadsheets they may contain from ansi text to utf8/utf16 text. In most cases he has no control over the input charset.

2/ Clearly, the new version of Lazarus (Lazarus 1.5+ with FPC 3.0+) is definitively oriented Unicode: UTF-8, for being precise. So, don't use anything else than utf8 strings in your code.

If you have to deal with "external" text data encoded differently, convert them immediately in UTF-8 just after getting them from the "external" source, or just before putting them back to the "external" source. And don't use anything than these utf8 strings in your code.


That is a no go. Sorry my import routines open read and insert ansi text files with the smallest one 700MB and the bigest one at 1.5GB average size is around 835MB. I'm not going to add a convertion from ansi to utf8 and utf8 to ansi just to make sure that lazarus can handle my data files correctly. I'll not upgrade to lazarus 1.5 and stay with 1.4 upgrading FPC to 3.0 I'll have some problems in the beginning but my clients will never knwo about it.


It may sound "hard", but I'm afraid that expecting no issues when porting existing code form Lazarus 1.4/Free Pascal 2.6.4 to Lazarus 1.5+/Free Pascal 3.0+ is, IMHO, unlikely . Not unless you never, ever used anything than utf8 strings and text data in your code.

However, please consider that they are only recommendations from a "Unicode newbie" ...
I'm afraid that expecting us to convert everything non lcl to utf8 is a no go.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

ChrisF

  • Hero Member
  • *****
  • Posts: 542
Re: Issues with new strings of FPC trunk
« Reply #28 on: July 02, 2015, 08:12:07 pm »
@taazz:

Once again, it's only my opinion and, as you already know it, my opinion is a bit different than the one from the Free Pascal and Lazarus teams.

I guess we should wait for their own recommendations before taking any decisions and starting to make big modifications.

I'm also just currently trying to understand what will be the implications of switching my own code to Free Pascal 3.0 and Lazarus 1.5+.

And especially to understand what kind of modifications would be necessary, in order to respect the "spirit" of these new versions; because I don't want to do the porting job more than once.

But each case are different, especially if you are not using the LCL...


**EDIT ** May be in your case, switching to the new versions without activating the EnableUTF8RTL define is a good solution.

Because I'm not sure that there is a big difference between:
- FPC 3.0+ LCL 1.4 ,  or
- FPC 3.0+ LCL 1.5+ without activating EnableUTF8RTL.

In both cases, the new code page feature has to been taken into account anyway.
 
« Last Edit: July 02, 2015, 09:32:22 pm by ChrisF »

taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: Issues with new strings of FPC trunk
« Reply #29 on: July 02, 2015, 09:46:52 pm »
@taazz:

As you already know it, it's only my opinion, and this opinion is a bit different than from the Free Pascal and Lazarus teams.

Stating my intent does not equate with a final decision. I'm not going to make a decision until I see the final product /release.

So far lcl seems to impose a specific encoding to my code, emfasis on my. To top it all it imposes its own choices to fpc rtl as well that is not accepted behavior from any library. It is bad enough that they see utf8 as the holy grail of the unicode world.

**EDIT ** May be in your case, switching to the new versions without activating the EnableUTF8RTL define is a good solution.

Because I'm not sure that there is a big difference between:
- FPC 3.0+ LCL 1.4 ,  or
- FPC 3.0+ LCL 1.5+ without activating EnableUTF8RTL.

In both cases, the new code page feature has to been taken into account anyway.

I do expect to have some problems with a unicode rtl but I do not to have any problems on code that does not uses any rtl code. a type change here and there is expected along with more elaborate changes in some cases (eg text parser) but lcl so far is a no starter.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

 

TinyPortal © 2005-2018