Recent

Author Topic: ANSI text conversion interrogation with forthcoming LCL (-dEnableUTF8RTL)  (Read 12848 times)

ChrisF

  • Hero Member
  • *****
  • Posts: 542
Preliminaries:
- by conversion, I mean conversion to the by-default encoding representation of text into the LCL, aka UTF8,
- only Windows is concerned.


I'm a bit puzzled by the challenge to keep an ascendant compatibility with existing source code dealing with ANSI text data.

Currently, when an application wants to deal (i.e. convert) simply with external ANSI text (text file, database, ...), some tools are offered by the RTL or the LCL: AnsiToUTF8/UTF8ToAnsi, SysToUTF8/UTF8ToSys, ...

Of course, they are using by default only the Windows code page of the computer running the application (or eventually, the one provided by the programmer into his/her source code when modifying the DefaultSystemCodePage value).


Now, when DefaultSystemCodePage is changed to UTF8 (i.e. SetMultiByteConversionCodePage(CP_UTF8), with -dEnableUTF8RTL), the existing source code converting external ANSI text to UTF8 is no more working (unless I've missed something - which is not impossible):

.  ANSIToUTF8/UTF8ToANSI, SysToUTF8/UTF8ToANSI, ... are no more working (i.e. no conversion is done),
.  the "internal" FreePascal conversion mechanism during assignment neither,
.  forcing variables to the ansistring type neither, ...

The only solution I'm currently seeing is to add some extra instructions to modify manually the code page of the string variables containing ANSI text data to the Windows code page value (or any other one), before doing the conversion (tools or "internal" FreePascal conversion).

Furthermore, the Windows code page is no more available using only FreePascal instructions, because the DefaultSystemCodePage value has been internally overridden by the LCL. So, the only way is to use directly the GetACP Windows API; or to use fixed code page values into the source code.

String variables with a given static code page can also be used (i.e. type MyCodePageString = type ANSIString(MyCodePageValue)), but it means that modifications are also required into the existing source code anyway.


As an illustration, attached a simple program which is loading and displaying an external ANSI text file (Windows code page 1252 is required for this sample text file; if necessary, create your own 'AnsiFile.txt' file for your own Windows code page and modify the sample source code).

Please note the additional instruction for the ANSI->UTF8 conversion, required with EnableUTF8RTL:
Code: [Select]
  // The conversion
  {$if (FPC_FULLVERSION>=20701) and defined(EnableUTF8RTL)}
  SetCodePage(rawbytestring(S), 1252, false);  // 1252 always !! (or Windows.GetACP())
  {$ifend}
  SConv := SysToUTF8(S);
  // Conversion done


Have I missed something evident, or is it an "accepted" ascendant compatibility issue ?
« Last Edit: June 26, 2015, 08:06:06 pm by ChrisF »

taazz

  • Hero Member
  • *****
  • Posts: 5368
Furthermore, the Windows code page is no more available using only FreePascal instructions, because the DefaultSystemCodePage value has been internally overridden by the LCL.
I have no idea as far as I know the latest lazarus release uses fpc 2.6.4 which does not support what you describe here but the above sentence is a no starter for me. If lcl goes that way I'll have to stop upgrading here and fork it.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

ChrisF

  • Hero Member
  • *****
  • Posts: 542
I have no idea as far as I know the latest lazarus release uses fpc 2.6.4 which does not support what you describe here   [...]

Yes, I'm talking only of the trunk version of Lazarus (what I've called the "forthcoming LCL"), additionally with Free Pascal 2.7.1+ (with the rawbytestring code page support, ...).

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4673
  • I like bugs.
Have I missed something evident, or is it an "accepted" ascendant compatibility issue ?

Indeed a text file with Windows (or some other) codepage requires extra code. I was planning to test it on Windows and document the alternatives in the feature's wiki page. Still haven't done that.
I see 2 ways to handle it.
1. Your way of setting the right codepage for a string in advance.
2. Use RawByteString and do an explicit conversion.

I made a skeleton document here :
  http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#Reading_text_file_with_Windows_codepage
Can you ChrisF please test more and update the page with tested examples.

It is important to remember that the "better" Unicode support in LCL makes everything easier when the input data is already UTF-8.
Things are easier also when data with different encodings is already in string variables. The conversions now happen dynamically and automatically.
Your case of a text file with Windows encoding is an exception and requires a little bit of extra code.
We also expect to find more problems in libraries bacause of old WinAPI calls. They must be replaced with the new W-function versions.
So far there are only 2 issues found :
  http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#Open_issues
Please everybody test more on Windows and report problems. I hope FPC 3.0 gets out ASAP, it would make testing easier.
« Last Edit: June 27, 2015, 02:06:05 am by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

taazz

  • Hero Member
  • *****
  • Posts: 5368
OK this is going to be my last post on the subject since I'm not involved in either the development nor the testing.

Have I missed something evident, or is it an "accepted" ascendant compatibility issue ?

Indeed a text file with Windows (or some other) codepage requires extra code. I was planning to test it on Windows and document the alternatives in the feature's wiki page. Still haven't done that.
I see 2 ways to handle it.
1. Your way of setting the right codepage for a string in advance.
2. Use RawByteString and do an explicit conversion.

The fact that the rtl variable DefautlSystemCodePage is changed by the lcl is a major problem. This variable should be used to talk to the system it self and lying about it makes things harder not easier.


I made a skeleton document here :
  http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#Reading_text_file_with_Windows_codepage
Can you ChrisF please test more and update the page with tested examples.

It is important to remember that the "better" Unicode support in LCL makes everything easier when the input data is already UTF-8.

which outside the browser is never in my experience. Up until 5 years ago the most common file encoding was some form of dos code page because the systems creating those where build that far back and on the data exchange front it was the default ansi windows code page or utf16. UTf 8 I have only seen on html to be frank.

Things are easier also when data with different encodings is already in string variables. The conversions now happen dynamically and automatically.
Your case of a text file with Windows encoding is an exception and requires a little bit of extra code.

Not really the exception is to find anything utf8 on the windows world.

We also expect to find more problems in libraries bacause of old WinAPI calls. They must be replaced with the new W-API versions.
So far there are only 2 issues found :
  http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#Open_issues
Please everybody test more on Windows and report problems. I hope FPC 3.0 gets out ASAP, it would make testing easier.
Sorry but as it stands now the "unicode version" will only be installed on a VM for running tests not going to change a few thousand lines of code because you think that the default windows code page is utf8.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

ChrisF

  • Hero Member
  • *****
  • Posts: 542
[...] Can you ChrisF please test more and update the page with tested examples.  [...]

As far as I've tested, all these 3 kinds of solution are working (BTW, they are general ANSI <-> UTF8 conversions, not limited only to text file data):

Code: [Select]
{$if (FPC_FULLVERSION>=20701) and defined(EnableUTF8RTL)}
  SetCodePage(rawbytestring(StrIn), 1252, false);  // 1252 always, or Windows.GetACP()
{$ifend}
  StrOut := SysToUTF8(StrIn);

Code: [Select]
{$if (FPC_FULLVERSION>=20701) and defined(EnableUTF8RTL)}
type
  string1252 = type ansistring(1252);
{$ifend}
...
{$if (FPC_FULLVERSION>=20701) and defined(EnableUTF8RTL)}
var StrIn: string1252;
{$else}
var StrIn: string;
{$ifend}
...
  StrOut := SysToUTF8(StrIn);

Code: [Select]
var StrIn: string;    // or rawbytestring (OK in both cases apparently, but rawbytestring type needs conditional code)
...
  StrOut := CP1252ToUTF8(StrIn);    // 1252 always, of course


For my own, I prefer the first one, because generic source code could be used for any Windows code page (providing you know this Windows code page value). Some new generic conversion functions could even be created, like (it's only a proposal):
Code: [Select]
{$if (FPC_FULLVERSION>=20701) and defined(EnableUTF8RTL)}
function AnsiCPChange(const s: string): string;
begin
  result := s;
  // Only if code page for string variable has not been already set (by a former AnsiCPChange call, or by user)
  if StringCodePage(s) = CP_UTF8 then
    if DefaultSystemCodePage = CP_UTF8 then
      SetCodePage(rawbytestring(result), Windows.GetACP, false)
    else
      // In case DefaultSystemCodePage has been changed by user
      SetCodePage(rawbytestring(result), DefaultSystemCodePage, false);
end;

function SysToUTF8Ext(const s: string): string;
begin
  result := SysToUTF8(AnsiCPChange(s));
end;

function UTF8ToSysExt(const s: string): string;
begin
  result := AnsiCPChange(UTF8ToSys(s));
end;
{$ifend}

Or eventually modify the existing SysToUTF8/UTF8ToSys LazUTF8 functions ? It would be great if possible, as it would mean that no change would be required in current user's source code (i.e. current function calls would be still OK).

Concerning AnsiToUTF8 and UTF8ToAnsi, I'm afraid they are no more working anyway; not without modifying the code page for ANSI string variables directly in the user's source code.


[...] Your case of a text file with Windows encoding is an exception and requires a little bit of extra code.  [...]

Well, if I'm correct, in fact EVERY existing (i.e. up to Lazarus 1.4/Free Pascal 2.6.4) following conversion instruction calls need to be carefully examined in current source code, and most probably modified (whatever is the origin of the ANSI data; not only text files, I mean): ANSIToUTF8 and UTF8ToANSI, SysToUTF8 and UTF8ToANSI.

Because I can't imagine any case for which these current function calls would also work properly with Lazarus 1.5+/FreePascal 2.7.1+ (as string variables containing ANSI data are now always identified as strings with an UTF8 code page by default). Except for ASCII only text, of course.


As an addition,

1/ If DefaultSystemCodePage is now always set to CP_UTF8, it would be nice to have at least a new variable containing really the Windows code page; like OSDefaultSystemCodePage or any other name (for the other OS, OSDefaultSystemCodePage = DefaultSystemCodePage).

It would break the Delphi compatibility (Delphi has only one DefaultSystemCodePage value), and it would require source code changes, but at least this new variable could be used in these changes (with conditional instructions however, depending of the LCL version).


2/ In new programs (i.e. not talking of the ANSI conversion problem for existing source code), how is it possible to declare a "generic" ANSI type string with LCL 1.5+ ? I mean, a string type with the Windows code page as a static code page value: neither "ansistring" nor "type ansistring(CP_ACP)" are working, as CP_ACP means now UTF8.

So, how can we declare a "realansistring" type, usable for any Windows code page (corresponding of course to the targeted computer Windows OS) ?


NB.
[...] It is important to remember that the "better" Unicode support in LCL makes everything easier when the input data is already UTF-8.  [...]

I'm certainly not arguing against a "better" Unicode support in the LCL/RTL. I'd just wanted to identify potential compatibility issues concerning the new LCL/FPC versions, at least as it's currently implemented in the corresponding trunk versions.
« Last Edit: June 27, 2015, 04:08:50 pm by ChrisF »

BeniBela

  • Hero Member
  • *****
  • Posts: 955
    • homepage
It seems it would be much better, if there was an option to set `string = UTF8String = string(UTF_8)` and keep ansistring as it used to be (string(CP_ACP)). 

And replace every occurrence of ansistring in the LCL with string/utf8string, if it does not want the real ansistring there.

« Last Edit: June 27, 2015, 05:54:19 pm by BeniBela »

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12641
  • FPC developer.
Or simply go for string=unicodestring.

BeniBela

  • Hero Member
  • *****
  • Posts: 955
    • homepage
But the LCL uses UTF-8 everywhere


taazz

  • Hero Member
  • *****
  • Posts: 5368
But the LCL uses UTF-8 everywhere
define everywhere or better yet define where this makes a difference. I'm not against lcl using utf8 I'm against covering it like its the OS default code page if lcl needs utf8, which I doubt then use utf8string instead of string and be done with it.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

ChrisF

  • Hero Member
  • *****
  • Posts: 542
[...]  Can you ChrisF please test more and update the page with tested examples.  [...]

A precision concerning the "explicit conversion" solution...

CPxxxxToUTF8 functions are OK, but it's more complicated for their opposite UTF8ToCPxxxx.

For these last ones, the data (in the result) are OK but the code page for the result returned by these functions is always the default one, i.e. CP_UTF8. Which will most probably causes a lot of problems to anybody using them.

Eventually, these functions could be modified, in order to force the corresponding code page for the result.

For instance, something like:
Code: [Select]
function UTF8ToCP1252(const s: string): string;
begin
  Result:=UTF8ToSingleByte(s,@UnicodeToCP1252,1252);    // Code page parameter added to all UTF8ToSingleByte call
end;

function UTF8ToSingleByte(const s: string;
  const UTF8CharConvFunc: TUnicodeToCharID; ResultCP: TSystemCodePage): string;
[...]
{$if (FPC_FULLVERSION>=20701) and defined(EnableUTF8RTL)}
  SetCodePage(rawbytestring(result), ResultCP, false);
{$ifend}
[...]

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4673
  • I like bugs.
Sorry for the delay in my answer.

Well, if I'm correct, in fact EVERY existing (i.e. up to Lazarus 1.4/Free Pascal 2.6.4) following conversion instruction calls need to be carefully examined in current source code, and most probably modified (whatever is the origin of the ANSI data; not only text files, I mean): ANSIToUTF8 and UTF8ToANSI, SysToUTF8 and UTF8ToANSI.

Because I can't imagine any case for which these current function calls would also work properly with Lazarus 1.5+/FreePascal 2.7.1+ (as string variables containing ANSI data are now always identified as strings with an UTF8 code page by default). Except for ASCII only text, of course.

Not really. Almost all existing conversion calls can be removed. Conversions happen automatically now. Only when reading from a file or raw stream, then you must take care of the encoding explicitly.
String data returned from library calls is not a problem as the strings have dynamic encoding and are converted as needed. For example functions calling WinAPI can  use UnicodeString explicitly. Many FPC libs do that and more will do so in future. This does not even pose any conflict between the encodings. Using the 'W'-WinAPI calls and UnicodeString makes LCL work right regardless of what encoding it uses.

Quote
1/ If DefaultSystemCodePage is now always set to CP_UTF8, it would be nice to have at least a new variable containing really the Windows code page; like OSDefaultSystemCodePage or any other name (for the other OS, OSDefaultSystemCodePage = DefaultSystemCodePage).
[...]
2/ In new programs (i.e. not talking of the ANSI conversion problem for existing source code), how is it possible to declare a "generic" ANSI type string with LCL 1.5+ ? I mean, a string type with the Windows code page as a static code page value: neither "ansistring" nor "type ansistring(CP_ACP)" are working, as CP_ACP means now UTF8.

Good questions but I don't have answers now. I even asked Mattias who has best knowledge on the topic but he is having holidays I think.
As some people may have noticed I have learned Unicode related stuff while testing/improving/documenting our new UTF-8 support.
Autumn may be better time to solve this, then also FPC 3.0RC1 is hopefully released making testing easier for many people.

How about UTF8ToWinCP and WinCPToUTF8? You already solved the problem somehow using Windows.GetACP etc.

Quote
CPxxxxToUTF8 functions are OK, but it's more complicated for their opposite UTF8ToCPxxxx.
For these last ones, the data (in the result) are OK but the code page for the result returned by these functions is always the default one, i.e. CP_UTF8. Which will most probably causes a lot of problems to anybody using them.
Eventually, these functions could be modified, in order to force the corresponding code page for the result.

Yes, that is a bug. Thanks for finding it. I will add it to the wiki page.

Please everybody try to find more real detailed problems in the new system. I personally was surprised how well it works. It provides Unicode support which is easier than the current hackish system. It is almost source compatible with Delphi although the encoding is different. I mean all Ansi...() string functions work etc.
Its usage is not forced to anybody, the old system can still be used.
It is functional now (or very soon anyway) because the new FPC features allowed implementing it with only small changes. The future UTF-16 support will take more time and effort.
So, I don't see reason for negative feelings against this new UTF-8 support. It does not take anything away from anybody but it offers improvements for people who want them.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

ChrisF

  • Hero Member
  • *****
  • Posts: 542
[...]  Almost all existing conversion calls can be removed. Conversions happen automatically now.  [...]

I do agree with you. But there is a big if: if the code page of the string variables are corresponding to the text data into these string variables. As there is no ansistring type any more, this mechanism doesn't work for ANSI data; at least not without a lot of precautions and a few additional code each time.


[...]  How about UTF8ToWinCP and WinCPToUTF8?  [...]

The conversions are OK, indeed: the conversions, but not the code page value for UTF8ToWinCP. And it's not working with "typed" ansistring (i.e. with constant static code page, as for "type ansistring(1252)", for instance).


Hereafter a summary I've done concerning the results for an ANSI -> UTF8 text data conversion, with various versions and configurations. Except when indicated, the results are identical for [SIn = ansistring, SOut = utf8string] or for [SIn, SOut = string].

Code: [Select]
!-------------------------------------------!-------!-------!-------!-------!
!     Lazarus version and additional        !   =   !  Sys  !  Ans  !  Win  !
!            configuration                  ! OK KO ! OK KO ! OK KO ! OK KO !
!-------------------------------------------!-------!-------!-------!-------!
!                                           !       !       !       !       !
!  LCL 1.4                                  !    KO ! OK    ! OK    ! OK    !
!                                           !       !       !       !       !
!-------------------------------------------!-------!-------!-------!-------!
!                                           !       !       !       !       !
!  LCL 1.5 without UTF8RTL                  !    KO ! OK    !    KO ! OK    !
!                                           !       !       !       !       !
!-------------------------------------------!-------!-------!-------!-------!
!                                           !       !       !       !       !
!  LCL 1.5 with UTF8RTL                     !    KO !    KO !    KO ! OK    !
!                                           !       !       !       !       !
!-------------------------------------------!-------!-------!-------!-------!
!                                           !       !       !       !       !
!  LCL 1.5 with UTF8RTL and SIn=string1252  ! OK    ! OK    ! OK    !    KO !
!                                           !       !       !       !       !
!-------------------------------------------!-------!-------!-------!-------!

SIn: string variable with ANSI data
SOut: string variable equal to: 'Start!' + XXX(SIn) + '!End'

Legend for XXX:
"=" :   SIn
"Sys":  SysToUTF8(SIn)
"Ans":  AnsiToUTF8(SIn)
"Win":  WinCPToUTF8(SIn)

type
  string1252 = type ansistring(1252);


According to my all -very short- tests, I've founded it's true that there is a "Better Unicode Support in Lazarus/FPC", with more simplifications especially for new programs.

But the lack of a real ansistring type seems to be quite a problem to me; not to say an issue. But that's only my opinion ...
« Last Edit: June 30, 2015, 05:54:30 pm by ChrisF »

ChrisF

  • Hero Member
  • *****
  • Posts: 542
Here is a very basic summary of what I've understood concerning the implementation of the incoming code page support for string variables in Lazarus/Free Pascal.

I'm intentionally omitting all the technical stuff, in order to make things as simple as possible and to focus mainly on the concepts.


FREE PASCAL by default (i.e. no EnableUTF8RTL define in Lazarus):
 ansistring   -->   ANSI
 utf8string   -->   UTF8
 string = ansistring   -->   ANSI


LAZARUS (with EnableUTF8RTL)
 ansistring   -->   UTF8
 utf8string   -->   UTF8
 string = ansistring = utf8string   -->   UTF8


In the latest case the ANSI string type has now disappeared. Apparently, there is no (to be confirmed/infirmed) such way to only have "string = utf8string", without modifying the "ansistring" type at the same time.

Something like:
 ansistring   -->   ANSI
 utf8string   -->   UTF8
 string = utf8string   -->   UTF8

Though this latest configuration looks more logical to me, and furthermore more adapted to a transition step.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4673
  • I like bugs.
ChrisF, the word ANSI apparently confuses your head. Please try to forget it and think of "encoding" instead.
The "Ansi"-prefix in string type and in string functions is confusing. Originally the string functions (no Ansi...) worked with plain ASCII (7-bit) = only English. Then Borland added support for new ANSI code pages (8-bit) and new functions with "Ansi"-prefix. At some point they named the dynamic string type as AnsiString maybe to differentiate it from ShortString.

Now the Ansi...() functions work with UTF-16 text (UnicodeString) in Delphi and UTF-8 text in Lazarus (with -dEnableUTF8RTL). The "Ansi"-prefix does not make much sense, it is only a historical remain.

ANSI means also "American National Standards Institute" but I don't know if it has anything to do with our Ansi-stuff.

Now the main rule must be to have always the right encoding in a string variable. When reading a system file/stream then maybe an extra function call is needed but so be it.
Mattias may have better ideas. The discussion must later be moved to Lazarus mailing list which is followed by more developers. I am still on the edge of understanding/not-understanding all the details.

Patches are welcome to fix the wrong types in conversion functions.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

 

TinyPortal © 2005-2018