Recent

Author Topic: ANSI text conversion interrogation with forthcoming LCL (-dEnableUTF8RTL)  (Read 12427 times)

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12314
  • FPC developer.
Something like:
 ansistring   -->   ANSI
 utf8string   -->   UTF8
 string = utf8string   -->   UTF8

The problem is that there can only be one variable encoding (as in encoding set on startup) at the moment.

You have two, one for "string", which you pin to utf8, and one for ansi.

"utf8string" as a type has no special handling in the compiler. If you want this, you'll have to fix your code for unicodestring. The enableutf8 is already the transition hack.




taazz

  • Hero Member
  • *****
  • Posts: 5368
ChrisF, the word ANSI apparently confuses your head. Please try to forget it and think of "encoding" instead.
The "Ansi"-prefix in string type and in string functions is confusing. Originally the string functions (no Ansi...) worked with plain ASCII (7-bit) = only English. Then Borland added support for new ANSI code pages (8-bit) and new functions with "Ansi"-prefix. At some point they named the dynamic string type as AnsiString maybe to differentiate it from ShortString.

Hell no. If you want to go as back as the dos era then you have to know that at that time the only existing standart was ascii and some companies where trying to maintain a list of code pages that used ascii for the first 7 bits and an other language for the rest values. At those times IBM was recognized as the middle ground although most big companies do their own thing. Ansi strings introduced on the windows development only and they had two things that the short strings did not 1) huge strings instead of up to 255 chars and 2) they supported the windows multibyte encodings. How the name ANSI came to be and what is its connection with the ansi institution I have no idea but I assume that the ANSI institution has some sort of multibyte char encoding defined somewhere.

Now the Ansi...() functions work with UTF-16 text (UnicodeString) in Delphi and UTF-8 text in Lazarus (with -dEnableUTF8RTL). The "Ansi"-prefix does not make much sense, it is only a historical remain.

That was a questionable decision on borlands part and something that was made to help the transition of existing code to unicode with the minimum of changes. But, they did not steam rolled the ansi strings to something it is not just to support unicode and they are still paying for it today. I expected a lot better lazarus I was wrong.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

ChrisF

  • Hero Member
  • *****
  • Posts: 542
[...]  Now the Ansi...() functions work with UTF-16 text (UnicodeString) in Delphi  [...]

True, but I was talking of Ansi string variables, not of Ansi...() functions.

And when Delphi has migrated to its Unicode version, the "ansistring" type has been kept untouched, in order people can quickly and easily port their old programs just by changing their "string" variables to "ansistring" ones. OK, it was probably a bit more complicated, but at least it has helped during the transition.

Currently, Lazarus doesn't offer this possibility and that's why I'm a bit worried.


[...]  ChrisF, the word ANSI apparently confuses your head. Please try to forget it and think of "encoding" instead.  [...]

I agree that the fact I'm using the same term for various things might be a bit confusing. So, here is technically exactly what I mean by an Ansi string variable type:

Code: [Select]
type
  realansistring = type ansistring( Windows.GetACP );

which of course can't be declared this way.
« Last Edit: June 30, 2015, 08:20:21 pm by ChrisF »

ChrisF

  • Hero Member
  • *****
  • Posts: 542
[...]  The problem is that there can only be one variable encoding (as in encoding set on startup) at the moment.   [...]

That's also how as I've understood the situation.


"utf8string" as a type has no special handling in the compiler.

I'm not exactly requesting an additional "special handling". What I've got in mind is rather a way to define a string type with ANSI (= GetACP of the running computer) as a static code page value.

As I can see it, the problem is that here is no difference between current encoding and Windows encoding: i.e. CP_ACP means in fact DefaultSystemCodePage, and not necessarily Windows.GetACP. While CP8_UTF8 always means UTF8.

So, having an additional TSystemCodePage type would be great IMHO:
Code: [Select]
{ some values which are used in RTL for TSystemCodePage type }
const
  CP_ACP     = 0;     // default to ANSI code page
  CP_OEMCP   = 1;     // default to OEM (console) code page
  CP_UTF16   = 1200;  // utf-16
  CP_UTF16BE = 1201;  // unicodeFFFE
  CP_UTF7    = 65000; // utf-7
  CP_UTF8    = 65001; // utf-8
  CP_ASCII   = 20127; // us-ascii
  CP_NONE    = $FFFF; // rawbytestring encoding

plus

Code: [Select]
  CP_REALACP = xxxx;     // ALWAYS to ANSI code page = Windows.GetACP

And this way, an "ansistring" type could be redefined as:
Code: [Select]
type
  ansistring = type ansistring( CP_REALACP );
and not as "ansistring( CP_ACP )", as it's currently the case.


But I admit I don't know enough this Free Pascal part, in order to know if it's possible (not even to know if it really makes sense). Though I understand it's currently not possible without some additions.
« Last Edit: June 30, 2015, 07:51:53 pm by ChrisF »

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12314
  • FPC developer.
I'm not exactly requesting an additional "special handling". What I've got in mind is rather a way to define a string type with ANSI (= GetACP of the running computer) as a static code page value.

The only variable type is ansistring(ACP)=ansistring(0), all conversions are done through this type.
The utf8enable hack abuses a layer to fix this on non-Windows platforms. That is already a non FPC supported hack, and you ask to build on top of that. I don't expect that is going to happen.

Changing any other type to something that is not a hardcoded value (iow determine on startup) means defining another runtime type, and out of compatibility doing that on more than one system.

Quote
As I can see it, the problem is that here is no difference between current encoding and Windows encoding: i.e. CP_ACP means in fact DefaultSystemCodePage, and not necessarily Windows.GetACP. While CP8_UTF8 always means UTF8.

Yes. But the enableutf8 hack sets the RTL equivalent of ACP to utf8.

Quote
Code: [Select]
  CP_REALACP = xxxx;     // ALWAYS to ANSI code page = Windows.GetACP

And since Windows.GetACP is not known compiletime, that  is another runtime typed type.

Quote
But I admit I don't know enough this Free Pascal part, in order to know if it's possible (not even to know if it really makes sense). Though I understand it's currently not possible without some additions.

Simple. Long term work should go into a Delphi compatible solution, not prolonging hacks.

If you need ACP,  don't enable utf8 hacks.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4599
  • I like bugs.
Anybody who wants to improve the new UTF-8 support, please find open issues and write them to the wiki page. I am sure we can solve them.
There is no need for the endless debate about how things should be in a perfect world. That debate already continued many years in FPC mailing lists wasting time and energy from many people.

My impression is that all remaining issues can be solved with WinCPToUTF8(). In any case please write to wiki the cases where changes are needed for old code.

Now we have the improved UTF-8 support (which is very good BTW).
Later there will be UTF-16 support but it will take few years. Nothing fancy about it. Just some work is needed to implement them.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

mattias

  • Administrator
  • Full Member
  • *
  • Posts: 204
    • http://www.lazarus.freepascal.org
Re: ANSI text conversion interrogation with forthcoming LCL (-dEnableUTF8RTL)
« Reply #21 on: September 22, 2015, 06:07:22 pm »
CPxxxxToUTF8 functions are OK, but it's more complicated for their opposite UTF8ToCPxxxx.

For these last ones, the data (in the result) are OK but the code page for the result returned by these functions is always the default one, i.e. CP_UTF8. Which will most probably causes a lot of problems to anybody using them.

Eventually, these functions could be modified, in order to force the corresponding code page for the result.

The UTF8ToCPxxxx functions under FPC 2.7.1+ now have a parameter "SetTargetCodePage: boolean = false".
You can choose whether the resulting string has the new codepage or the default codepage. The default is false for compatibility.

ChrisF

  • Hero Member
  • *****
  • Posts: 542
Re: ANSI text conversion interrogation with forthcoming LCL (-dEnableUTF8RTL)
« Reply #22 on: September 23, 2015, 07:23:53 pm »
Sorry, I'm not sure to understand...

Do you mean that:

- UTF8ToCPxxxx(String) = UTF8ToCPxxxx(String, false)  gives CP_UTF8 (with -dEnableUTF8RTL, of course) as the code page for the result of the function,

- while UTF8ToCPxxxx(String, true) gives CP_xxxx ?

Or the opposite ?


** Edit **

Looking at the current source code of the trunk version, the answer is apparently yes (i.e. true sets the code page of the result to CP_xxxx).

BTW, I can see in this part of the current source code -at least- 2 new conditional directives: UseSystemCPCon and UseLCPConv.  What are their functions ?
« Last Edit: September 23, 2015, 07:50:15 pm by ChrisF »

 

TinyPortal © 2005-2018