Recent

Author Topic: WideString to AnsiString bug  (Read 9693 times)

Enigma

  • Newbie
  • Posts: 4
WideString to AnsiString bug
« on: August 13, 2018, 04:10:23 pm »
Hi, I found a bug in FPC, at least there is a difference of how the Wide to Ansi string conversioni the Delphi does.

Imagine, if you have a char §, it's byte value is 0xA7.

- In WideString this char is interpreted like two bytes 0x00 0xA7, that's OK
- In AnsiString, this char is interpreted like a byte 0xA7, that's OK
- But, if you convert this WideString to AnsiString, in AnsiString this char will be interpreted as two bytes 0xC2 0xA7, which is wrong!

Example:
var
  w : WideString;
  s : AnsiString;
begin
  w := '§';
  s := AnsiString(w);
  // "s" here contains two bytes 0xC2 0xA7, instead of just one 0xA7
end;

Btw, Delphi does conversion correctly. Who knows how to done the conversion correctly?

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11458
  • FPC developer.
Re: WideString to AnsiString bug
« Reply #1 on: August 13, 2018, 04:16:23 pm »
FPC or Lazarus? Lazarus initializes the default codepage to utf8, FPC should default to the OS codepage.

And I assume both Delphi and FPC/Lazarus are on the same windows ? (since default encodings might differ between systems and OS (-versions))


Enigma

  • Newbie
  • Posts: 4
Re: WideString to AnsiString bug
« Reply #2 on: August 13, 2018, 04:27:58 pm »
This is an FPC really, Lazarus is just IDE...
This happens on the same Windows, not different.

I doubt there is a problem in codepage, because § is ansi symbol encoded by one 0xA7 byte. For such one byte symbols the codepage should not matter. It's conversion to WideString and back to AnsiString has to be same and return 0x00 0xA7 in widestring and just 0xA7 in ansi string. Delphi does the correct conversion.

However FPC adds 0xC2 byte while ansi conversion... I do not understand why...

Blaazen

  • Hero Member
  • *****
  • Posts: 3241
  • POKE 54296,15
    • Eye-Candy Controls
Re: WideString to AnsiString bug
« Reply #3 on: August 13, 2018, 04:37:07 pm »
It must be UTF8, it does not use chars between $80 and $FF.
Lazarus 2.3.0 (rev main-2_3-2863...) FPC 3.3.1 x86_64-linux-qt Chakra, Qt 4.8.7/5.13.2, Plasma 5.17.3
Lazarus 1.8.2 r57369 FPC 3.0.4 i386-win32-win32/win64 Wine 3.21

Try Eye-Candy Controls: https://sourceforge.net/projects/eccontrols/files/

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11458
  • FPC developer.
Re: WideString to AnsiString bug
« Reply #4 on: August 13, 2018, 05:02:25 pm »
This is an FPC really, Lazarus is just IDE...
This happens on the same Windows, not different.

An IDE as Lazarus can sets some parameters to non standard values by default, which is why I asked.

Do you really use FPC on the cmdline, OR do you use lazarus?

Quote
However FPC adds 0xC2 byte while ansi conversion... I do not understand why...

It is the utf8 equivalent of A7, see the utf8 (hex) field of this page 

Enigma

  • Newbie
  • Posts: 4
Re: WideString to AnsiString bug
« Reply #5 on: August 13, 2018, 05:26:21 pm »
An IDE as Lazarus can sets some parameters to non standard values by default, which is why I asked.

Do you really use FPC on the cmdline, OR do you use lazarus?

Sorry if it was unclear, I'm using Lazarus.

Based on the article you sent, looks like that UTF8 encoding of this char is really two bytes.

But confusion is that if I force to convert string to ansi, I still get the same result. I.e. the function UTF8toAnsi still returns this two byte character.

Do you think that AnsiString in my case defaults to utf8, instead of real ansi? If so, how to change it to ansi?

Bart

  • Hero Member
  • *****
  • Posts: 5290
    • Bart en Mariska's Webstek
Re: WideString to AnsiString bug
« Reply #6 on: August 13, 2018, 05:26:51 pm »
Code: Pascal  [Select][+][-]
  1. {$apptype console}
  2. {$mode objfpc}
  3. {$h+}
  4.  
  5. uses
  6.   sysutils;
  7.  
  8. function StrToHex(S: rawbytestring): shortstring;
  9. var
  10.   sd: ShortString;
  11.   i: Integer;
  12. begin
  13.   sd := '';
  14.   for i := 1 to Length(s) do
  15.     sd := sd + '$' + IntToHex(Byte(s[i]), 2) + ' ';
  16.   sd := trim(sd);
  17.   result := sd;
  18. end;
  19.  
  20. function StrToHex(S: WideString): shortstring;
  21. var
  22.   sd: ShortString;
  23.   i: Integer;
  24. begin
  25.   sd := '';
  26.   for i := 1 to Length(s) do
  27.     sd := sd + '$' + IntToHex(Word(s[i]), 4) + ' ';
  28.   sd := trim(sd);
  29.   result := sd;
  30. end;
  31.  
  32. var
  33.   w : WideString;
  34.   s : AnsiString;
  35. begin
  36.   if ParamStr(1) = 'UTF8' then
  37.     DefaultSystemCodePage := CP_UTF8;
  38.   w := #$00A7; //'§'
  39.   s := AnsiString(w);
  40.   writeln('W: ',StrToHex(w));
  41.   writeln('S: ',StrToHex(S));
  42. end.

Code: [Select]
C:\Users\Bart\LazarusProjecten\ConsoleProjecten>fpc test.pas
Free Pascal Compiler version 3.0.4rc1 [2017/07/03] for i386
Copyright (c) 1993-2017 by Florian Klaempfl and others
Target OS: Win32 for i386
Compiling test.pas
Linking test.exe
42 lines compiled, 0.2 sec, 66688 bytes code, 4116 bytes data

C:\Users\Bart\LazarusProjecten\ConsoleProjecten>test
W: $00A7
S: $A7

C:\Users\Bart\LazarusProjecten\ConsoleProjecten>test UTF8
W: $00A7
S: $C2 $A7

(Same result for fpc trunk).

Bart

Bart

  • Hero Member
  • *****
  • Posts: 5290
    • Bart en Mariska's Webstek
Re: WideString to AnsiString bug
« Reply #7 on: August 13, 2018, 05:32:50 pm »
Note that Lazarus IDE saves you code as UTF8 by default.
So the literal '§' will already be encoded as UTF8.

If you compile from commandline and use w := '§'; (instead of w := #$00A7) then your output will be:

Code: [Select]
W: $00C2 $00A7
S: $C2 $A7

Bart

Enigma

  • Newbie
  • Posts: 4
Re: WideString to AnsiString bug
« Reply #8 on: August 13, 2018, 05:38:21 pm »
Note that Lazarus IDE saves you code as UTF8 by default.
So the literal '§' will already be encoded as UTF8.

If you compile from commandline and use w := '§'; (instead of w := #$00A7) then your output will be:

Code: [Select]
W: $00C2 $00A7
S: $C2 $A7

Bart

Thanks Bart & marcov, it's now clear what is happening.

Btw, can you please point me, how to specify in Lazarus not to use UTF8 for ansistrings, but use the same way as Delphi does (ansi is ansi)?

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: WideString to AnsiString bug
« Reply #9 on: August 13, 2018, 10:24:59 pm »
can you please point me, how to specify in Lazarus not to use UTF8 for ansistrings, but use the same way as Delphi does (ansi is ansi)?
If your application is GUI using LCL, you can not. It breaks in areas where the code expects UTF8. Search for UTF8To in the LCL directory, for instance. On the other hand, if your app is a console, not using LCL, it is already ANSI.

But the question is why are you still using ANSI in 2018?

Bart

  • Hero Member
  • *****
  • Posts: 5290
    • Bart en Mariska's Webstek
Re: WideString to AnsiString bug
« Reply #10 on: August 13, 2018, 10:26:31 pm »
Project Options -> Compiler Options -> Additions and overrides -> Use system encoding (top right corner)
or
Project Options -> Compiler Options -> Custom Options -> enter -dDisableUTF8RTL in the memo on the right box.

You're working agains the system though.
Lazarus/LCL is basically UTF8 based.

Bart

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11458
  • FPC developer.
Re: WideString to AnsiString bug
« Reply #11 on: August 14, 2018, 10:35:57 am »
But the question is why are you still using ANSI in 2018?

It is still the default 1-byte windows encoding, and many DLLs are specified as using ansi. This is one of the problems of the UTF8 hack.

Bart

  • Hero Member
  • *****
  • Posts: 5290
    • Bart en Mariska's Webstek
Re: WideString to AnsiString bug
« Reply #12 on: August 14, 2018, 12:14:10 pm »
It is still the default 1-byte windows encoding, and many DLLs are specified as using ansi. This is one of the problems of the UTF8 hack.

Which is why Lazarus also has the Utf8ToWinCP end WinCPToUtf8 conversion routines.

And yes, I know, this is not Delphi compatible (where string = widestring througout RTL end VCL).
(And we should not start a discussion about that in this thread.)

Bart

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11458
  • FPC developer.
Re: WideString to AnsiString bug
« Reply #13 on: August 14, 2018, 01:30:43 pm »
It is still the default 1-byte windows encoding, and many DLLs are specified as using ansi. This is one of the problems of the UTF8 hack.

Which is why Lazarus also has the Utf8ToWinCP end WinCPToUtf8 conversion routines.

Yes, but those are leftovers of <= FPC 3.0 versions, and show the crucial problem with the utf8 hack: there is no string type anymore for the default system ansi encoding, but two for utf8 (utf8string and ansistring(0) aka CP_DEFAULT)

iow 

Code: Pascal  [Select][+][-]
  1. function UTF8ToWinCP(const s: string): string; inline;

returns ansistring(0) which is utf8 as per utf8hack. 

The whole thing is not logical nor consistent, so we don't even have to drag in delphi compatibility (with BOTH old and >=D2009 versions btw), but that is also a factor.


That said, Mattias has repeatedly confirmed that this situation is a temporary, transitional hack till we get >D2009 compatible.
« Last Edit: August 14, 2018, 01:32:58 pm by marcov »

PascalDragon

  • Hero Member
  • *****
  • Posts: 5486
  • Compiler Developer
Re: WideString to AnsiString bug
« Reply #14 on: August 14, 2018, 01:55:38 pm »
And yes, I know, this is not Delphi compatible (where string = widestring througout RTL end VCL).
(it's String = UnicodeString, not WideString  :-[ )

 

TinyPortal © 2005-2018