Recent

Author Topic: WideString to AnsiString bug  (Read 3578 times)

Thaddy

  • Hero Member
  • *****
  • Posts: 8508
Re: WideString to AnsiString bug
« Reply #15 on: August 14, 2018, 02:21:40 pm »
Which is quite a difference too in the context. What do the unicode (which in the case of delphi is unicode 16, not utf8, and refcounted) to ansi routines do? These are not governed by Lazarus code, but by fpc core code..where unicode also means unicode 16.
« Last Edit: August 14, 2018, 02:23:33 pm by Thaddy »
Read the manuals and if you are a professional get a proper education in computer science. Makes the forum a lot cleaner.

engkin

  • Hero Member
  • *****
  • Posts: 2513
Re: WideString to AnsiString bug
« Reply #16 on: August 14, 2018, 03:09:17 pm »
But the question is why are you still using ANSI in 2018?

It is still the default 1-byte windows encoding, and many DLLs are specified as using ansi. This is one of the problems of the UTF8 hack.

The 1-byte windows encoding is a leftover from older versions before introducing UCS-2, expanding it to UTF16, and, finally, adding UTF8.


Thaddy

  • Hero Member
  • *****
  • Posts: 8508
Re: WideString to AnsiString bug
« Reply #17 on: August 14, 2018, 03:25:45 pm »
It hasn't been the default encoding since NT4.There the default was UCS2, which later became a strictly 2 byte first incarnation of UTF 16.
Since Windows XP Ansi encoding are stubs (most of them) for UCS2/UTF16. On ALL windows platforms. The A's go through W's, except for some low-level C stuff, like strlen and family.
You know that, Marco.
Read the manuals and if you are a professional get a proper education in computer science. Makes the forum a lot cleaner.

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7311
Re: WideString to AnsiString bug
« Reply #18 on: August 14, 2018, 03:36:21 pm »
Which is quite a difference too in the context. What do the unicode (which in the case of delphi is unicode 16, not utf8, and refcounted) to ansi routines do? These are not governed by Lazarus code, but by fpc core code..

unicodestring routines that are functions generally return ansistring(0).  Codepage 0 is special, because it is  a dynamic codepage (so that it can be detected on startup, and held in a variable). Lazarus manipulates this mechanism by changing this codepage to mean utf8 on startup. That still works.

The problems are that code that assumes this is lazarus specific (since quite often it assumes unicodestring<->ansistring(0) conversions are lossless), and the already mentioned problem that the original, delphi compatible, reason that there is a dynamic codepage 0 (to set on startup to the system encoding) has no place in this mechanism.

As soon as class types start changing to unicodestring, utf8hack code will break, but it might be fixable easier. At least easier than code from FPC 2.x based lazaruses, so that is a win.

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7311
Re: WideString to AnsiString bug
« Reply #19 on: August 14, 2018, 03:42:03 pm »
The 1-byte windows encoding is a leftover from older versions before introducing UCS-2, expanding it to UTF16, and, finally, adding UTF8.

"Leftover" is not an official Microsoft position, and like the wikipedia url a lot of IMNSHO.

Worse, the wikipedia url is pure speculation. No windows currently is configured as such out of the box, and doing otherwise, is a hack with bad backward compatibility (since it applies to all apps)

The formal, documented, microsoft viewpoint is still that unicode on api level is utf16, even if it supports utf8 for documents and not an active 1-byte encoding. And the formal, documented active encodings are the ones you dub "leftover".




Bart

  • Hero Member
  • *****
  • Posts: 3465
    • Bart en Mariska's Webstek
Re: WideString to AnsiString bug
« Reply #20 on: August 14, 2018, 03:43:56 pm »
Code: Pascal  [Select]
  1. function UTF8ToWinCP(const s: string): string; inline;

returns ansistring(0) which is utf8 as per utf8hack. 

Utf8ToWinCP sets codepage to CP_ACP in it's conversion (not sure if that matters w.r.t. your argumentation).
However the result can be used as a parameter for external AnsiString API's, usually casted as a PChar.
In fact, that's the main reason they exist IIRC.

Yes, it's a hack.

Bart

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7311
Re: WideString to AnsiString bug
« Reply #21 on: August 14, 2018, 04:06:47 pm »
Code: Pascal  [Select]
  1. function UTF8ToWinCP(const s: string): string; inline;

returns ansistring(0) which is utf8 as per utf8hack. 

Utf8ToWinCP sets codepage to CP_ACP in it's conversion (not sure if that matters w.r.t. your argumentation).

From systemh.inc

Code: Pascal  [Select]
  1.  CP_ACP     = 0;     // default to ANSI code page

IOW, no, it does not matter, it is only a constant for the 0. With the hack enable, that is utf8.

Quote
However the result can be used as a parameter for external AnsiString API's, usually casted as a PChar.
In fact, that's the main reason they exist IIRC.

Yes, it's a hack.

Yes. Though more elegantly it should be done with a rawbytestring. (see e.g. FPC windows RTL implementation).

But that is more a workaround for fairly local translation, or going back to the old pre FPC 3 misery of never knowing what encoding is in a string, and hard to find bugs if you miss a spot where manual conversion should be inserted.

The FPC 3 (cq Delphi) system is to end that kind of trouble.

engkin

  • Hero Member
  • *****
  • Posts: 2513
Re: WideString to AnsiString bug
« Reply #22 on: August 14, 2018, 07:47:12 pm »
The 1-byte windows encoding is a leftover from older versions before introducing UCS-2, expanding it to UTF16, and, finally, adding UTF8.

"Leftover" is not an official Microsoft position, and like the wikipedia url a lot of IMNSHO.
Leftover is meant in the sense of being there for backward compatibility. As for the Wikipedia page, it is to give some dates showing when Microsoft added Unicode. But you seem to have a problem believing that UTF8 is possible as the codepage of the locale in recent Windows versions. I am sure you can try it for yourself.

Worse, the wikipedia url is pure speculation. No windows currently is configured as such out of the box, and doing otherwise, is a hack with bad backward compatibility (since it applies to all apps)
Why would it need to be configured out of the box?

The formal, documented, microsoft viewpoint is still that unicode on api level is utf16, even if it supports utf8 for documents and not an active 1-byte encoding. And the formal, documented active encodings are the ones you dub "leftover".

It is about Unicode, not about UTF8 vs UTF16. I am totally aware of your position on which encoding your want. I personally prefer to see, while I know it will not happen, support for UTF32.

Edit:
"Beta: Use Unicode UTF-8 for worldwide language support" feature automatically activate[d] on her computer.

Attached image linked on https://news.ycombinator.com/item?id=15710685 by "rossy"
« Last Edit: August 14, 2018, 08:31:01 pm by engkin »

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7311
Re: WideString to AnsiString bug
« Reply #23 on: August 14, 2018, 08:29:58 pm »
Leftover is meant in the sense of being there for backward compatibility. As for the Wikipedia page, it is to give some dates showing when Microsoft added Unicode. But you seem to have a problem believing that UTF8 is possible as the codepage of the locale in recent Windows versions. I am sure you can try it for yourself.

Well, I only see a lot of east asian there. I do see a beta utf8 option though, if I search for it.

Why would it need to be configured out of the box?

As hard it is to explain hacks like the lazarus utf8 hack to developers, God forbid I have to explain settings about encoding to end-users.

So it is fine if you can do this on startup of your app, in your app's bubble. If you have to do it globally, it is IMHO useless. Or at least when all of windows supports this (still beta) option.

Quote
It is about Unicode, not about UTF8 vs UTF16. I am totally aware of your position on which encoding your want. I personally prefer to see, while I know it will not happen, support for UTF32.

You must write a lot of text renderers (afaik the only thing UTF32 is actively used for )

It's about the normal situation on Windows. And windows is roughly equal to the supported versions (+ say a year of grace period, as we have observed in the past for w2000, XP and vista).  So since w8 is good till 2023, you can't make it an requirement till 2024.



engkin

  • Hero Member
  • *****
  • Posts: 2513
Re: WideString to AnsiString bug
« Reply #24 on: August 14, 2018, 08:33:04 pm »
Please check the edited part and the image in my previous post. Notice the call to MessageBoxA

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7311
Re: WideString to AnsiString bug
« Reply #25 on: August 14, 2018, 08:46:28 pm »
Please check the edited part and the image in my previous post. Notice the call to MessageBoxA

My previous message already reacts to that, please read it.

nanobit

  • New Member
  • *
  • Posts: 39
Re: WideString to AnsiString bug
« Reply #26 on: December 08, 2018, 12:31:11 pm »
Btw, Delphi does conversion correctly. Who knows how to done the conversion correctly?

I've just hit the same issue.
In summary, for Windows targets, Lazarus has no compiletime codepage to represent windows.getAcp,
but cp_acp always means cp_utf8. Thus all calls to utf8ToAnsi(), utf8ToWinCp() produce utf8.
If one needs to pass params to a windows ansi dll, you need something like this:

Code: Pascal  [Select]
  1. function getRawSysString( const s: rawByteString): rawByteString;
  2. begin
  3.   result := s;
  4.   setCodePage( result, windows.getAcp(), true);
  5. end;
  6.  
  7. var
  8.   utf8: string; // acp string in lazarus
  9.   winParam: rawByteString; // encoded with windows.getAcp()
  10. begin
  11.   utf8 := utf8Restricted; // only a subset which is presentable in windows.getAcp()
  12.   winParam := getRawSysString( utf8);  
  13.   dllFunc( pchar(winParam));