Recent

Author Topic: Issues with CP_UTF7  (Read 985 times)

Bart

  • Hero Member
  • *****
  • Posts: 3547
    • Bart en Mariska's Webstek
Issues with CP_UTF7
« on: May 12, 2019, 10:15:14 am »
Hi,

I stumbled upon some issues with CP_UTF7 encoded strings:
  • You cannot assign a literal string to a CP_UTF7 encoded string type
  • Conversion from CP_UTF7 to any other type of AnsiString results in an empty string
  • Conversion from CP_UTF7 to ShortString treats encoded bytes as ASCII, but does not decode the string

See the following example:
Code: Pascal  [Select]
  1. program cps;
  2.  
  3. {$mode objfpc}
  4. {$h+}
  5.  
  6. uses
  7.   sysutils;
  8.  
  9. type
  10.   AsciiString = type AnsiString(CP_ASCII);
  11.   Utf7String  = type AnsiString(CP_UTF7);
  12.  
  13. var
  14.   U7: Utf7String;
  15.   U8: Utf8String;
  16.   A: AsciiString;
  17.   S,S2: String;
  18.   SS: ShortString;
  19.  
  20. function StrToHex(const S: rawbytestring): string;
  21. var
  22.   sd: ShortString;
  23.   i: Integer;
  24. begin
  25.   sd := format('[%5d] ',[StringCodePage(S)]);
  26.   for i := 1 to Length(s) do
  27.     sd := sd + '$' + IntToHex(Byte(s[i]), 2) + ' ';
  28.   sd := trim(sd);
  29.   result := sd;
  30. end;
  31.  
  32. function StrToHex(const S: UnicodeString): shortstring;
  33. var
  34.   sd: ShortString;
  35.   i: Integer;
  36. begin
  37.   sd := '';
  38.   for i := 1 to Length(s) do
  39.     sd := sd + '$' + IntToHex(Word(s[i]), 4) + ' ';
  40.   sd := trim(sd);
  41.   result := sd;
  42. end;
  43.  
  44. begin
  45.   //U7 := 'U7';     //cps.lpr(45,9) Error: Unknown codepage "65000"
  46.   repeat
  47.     write('S: ');
  48.     readln(S);
  49.     U7 := S;
  50.     U8 := S;
  51.     A := S;
  52.     writeln('S : ',StrToHex(S),' [',S,']');
  53.     writeln('U8: ',StrToHex(U8),' [',U8,']');
  54.     writeln('U7: ',StrToHex(U7),' [',U7,']');
  55.     writeln('A : ',StrToHex(A),' [',A,']');
  56.     SS := U7;
  57.     S2 := U7;
  58.     U8 := U7;
  59.     writeln('Utf7 -> ShortString: ',StrToHex(SS),' [',SS,']');
  60.     writeln('Utf7 -> CP_ACP     : ',StrToHex(S2),' [',S2,']');
  61.     writeln('Utf7 -> CP_UTF8    : ',StrToHex(U8),' [',U8,']');
  62.   until S='';
  63. end.

And this sample input

Code: [Select]
C:\Users\Bart\LazarusProjecten\bugs\Console\cpstring>cps
S: 1 + 1 = 2
S : [ 1252] $31 $20 $2B $20 $31 $20 $3D $20 $32 [1 + 1 = 2]
U8: [65001] $31 $20 $2B $20 $31 $20 $3D $20 $32 [1 + 1 = 2]
U7: [65000] $31 $20 $2B $2D $20 $31 $20 $2B $41 $44 $30 $2D $20 $32 []
A : [20127] $31 $20 $2B $20 $31 $20 $3D $20 $32 [1 + 1 = 2]
Utf7 -> ShortString: [ 1252] $31 $20 $2B $2D $20 $31 $20 $2B $41 $44 $30 $2D $20 $32 [1 +- 1 +AD0- 2]
Utf7 -> CP_ACP     : [ 1252] []
Utf7 -> CP_UTF8    : [ 1252] []

According to https://en.wikipedia.org/wiki/UTF-7 the string '1 + 1 = 2' indeed shall be encoded as '1 +- 1 +AD0- 2' in CP_UTF7 (so, $31 $20 $2B $2D $20 $31 $20 $2B $41 $44 $30 $2D $20 $32 is correct).
The decoding however should IMO result in the original string (given that the input is pure ASCII, encode/decode should be lossless).

Since I have no Delphi to test this I do not know if this behaviour is Delphi compatible.
IMO it's a bit inconsistent regardless of wether the Greek did it the same way.

Bart

ASerge

  • Hero Member
  • *****
  • Posts: 1422
Re: Issues with CP_UTF7
« Reply #1 on: May 12, 2019, 02:48:15 pm »
On windows, this function is called when converting from one code page to another (syswin.inc):
Code: Pascal  [Select]
  1. procedure Win32Ansi2UnicodeMove(source:pchar;cp : TSystemCodePage;var dest:UnicodeString;len:SizeInt);
  2. var
  3.   destlen: SizeInt;
  4.   dwflags: DWORD;
  5. begin
  6.   // retrieve length including trailing #0
  7.   // not anymore, because this must also be usable for single characters
  8.   if cp=CP_UTF8 then
  9.     dwFlags:=0
  10.   else
  11.     dwFlags:=MB_PRECOMPOSED;
  12.   destlen:=MultiByteToWideChar(cp, dwFlags, source, len, nil, 0);
  13.   // this will null-terminate
  14.   setlength(dest, destlen);
  15.   if destlen>0 then
  16.     begin
  17.       MultiByteToWideChar(cp, dwFlags, source, len, @dest[1], destlen);
  18.       PUnicodeRec(pointer(dest)-UnicodeFirstOff)^.CodePage:=CP_UTF16;
  19.     end;
  20. end;
See line 8. For UTF8 only, the flag is zero. But according to Microsoft documentation "...dwFlags must be set to 0...  65000 (UTF-7)...   Otherwise, the function fails with ERROR_INVALID_FLAGS".
I went through the debugger to the point of return. Indeed, the function returns 0 and GetLastError=1004 (ERROR_INVALID_FLAGS).
The fix in the Win32Ansi2UnicodeMove function should be as follows:
Code: Pascal  [Select]
  1. procedure Win32Ansi2UnicodeMove(source:pchar;cp : TSystemCodePage;var dest:UnicodeString;len:SizeInt);
  2. var
  3.   destlen: SizeInt;
  4.   dwflags: DWORD;
  5. begin
  6.   // retrieve length including trailing #0
  7.   // not anymore, because this must also be usable for single characters
  8.   case cp of
  9.     // Under https://docs.microsoft.com/en-us/windows/desktop/api/stringapiset/nf-stringapiset-multibytetowidechar
  10.     CP_UTF8, CP_UTF7, 50220, 50221, 50222, 50225, 50227, 50229, 57002..57011, 42:
  11.       dwFlags:=0
  12.   else
  13.     dwFlags:=MB_PRECOMPOSED;
  14.   end;
  15.   destlen:=MultiByteToWideChar(cp, dwFlags, source, len, nil, 0);
  16.   // this will null-terminate
  17.   setlength(dest, destlen);
  18.   if destlen>0 then
  19.     begin
  20.       MultiByteToWideChar(cp, dwFlags, source, len, @dest[1], destlen);
  21.       PUnicodeRec(pointer(dest)-UnicodeFirstOff)^.CodePage:=CP_UTF16;
  22.     end;
  23. end;

ASerge

  • Hero Member
  • *****
  • Posts: 1422
Re: Issues with CP_UTF7
« Reply #2 on: May 12, 2019, 03:05:32 pm »
Checked with Delphi. Everything works correctly there. By the way, Delphi calls the MultiByteToWideChar function always with the flag 0, regardless of the code page.

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7605
Re: Issues with CP_UTF7
« Reply #3 on: May 12, 2019, 04:19:07 pm »
On windows, this function is called when converting from one code page to another (syswin.inc):
 The fix in the Win32Ansi2UnicodeMove function should be as follows:

r42043
 

Bart

  • Hero Member
  • *****
  • Posts: 3547
    • Bart en Mariska's Webstek
Re: Issues with CP_UTF7
« Reply #4 on: May 12, 2019, 10:09:48 pm »
Not fixed
Code: Pascal  [Select]
  1. U7 := 'U7';     //cps.lpr(45,9) Error: Unknown codepage "65000"

Also the conversion to ShortString is stil wrong IMO:
Code: [Select]
S: 1 + 1 = 2
...
U7: [65000] $31 $20 $2B $2D $20 $31 $20 $2B $41 $44 $30 $2D $20 $32 [1 + 1 = 2]
...
Utf7 -> ShortString: [ 1252] $31 $20 $2B $2D $20 $31 $20 $2B $41 $44 $30 $2D $20 $32 [1 +- 1 +AD0- 2]
Utf7 -> CP_ACP     : [ 1252] $31 $20 $2B $20 $31 $20 $3D $20 $32 [1 + 1 = 2]
Utf7 -> CP_UTF8    : [65001] $31 $20 $2B $20 $31 $20 $3D $20 $32 [1 + 1 = 2]

Bart
« Last Edit: May 12, 2019, 10:21:06 pm by Bart »

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7605
Re: Issues with CP_UTF7
« Reply #5 on: May 12, 2019, 10:24:15 pm »
Maybe you need to load a codepage with -Fc ?

Bart

  • Hero Member
  • *****
  • Posts: 3547
    • Bart en Mariska's Webstek
Re: Issues with CP_UTF7
« Reply #6 on: May 12, 2019, 10:40:54 pm »
Why would that be necessary?
The entire source code is plain ASCII.

Code: [Select]
C:\Users\Bart\LazarusProjecten\bugs\Console\cpstring>fpc -FcUTF7 cps.lpr
Error: Unknown codepage "UTF7"
Error: C:\pp\bin\i386-win32\ppc386.exe returned an error exitcode

C:\Users\Bart\LazarusProjecten\bugs\Console\cpstring>fpc -Fc65000 cps.lpr
Error: Unknown codepage "65000"
Error: C:\pp\bin\i386-win32\ppc386.exe returned an error exitcode

C:\Users\Bart\LazarusProjecten\bugs\Console\cpstring>fpc -FcUTF-7 cps.lpr
Error: Unknown codepage "UTF-7"
Error: C:\pp\bin\i386-win32\ppc386.exe returned an error exitcode

And it gets even better:
Code: [Select]
C:\Users\Bart\LazarusProjecten\bugs\Console\cpstring>fpc -FcUTF8 cps.lpr
Free Pascal Compiler version 3.3.1 [2019/05/12] for i386
Copyright (c) 1993-2018 by Florian Klaempfl and others
Target OS: Win32 for i386
Compiling cps.lpr
cps.lpr(45,9) Error: Unknown codepage "65000"
cps.lpr(45,9) Error: Compilation raised exception internally
Fatal: Compilation aborted
An unhandled exception occurred at $00477B14:
EAccessViolation: Access violation
  $00477B14  GETASCII,  line 697 of C:/devel/fpc/trunk/rtl/inc/charset.pp
  $004C3954  TTYPECONVNODE__SIMPLIFY,  line 2926 of ncnv.pas
  $004C27F9  TTYPECONVNODE__PASS_TYPECHECK,  line 2426 of ncnv.pas
  $004CC597  TYPECHECKPASS_INTERNAL,  line 81 of pass_1.pas
  $004BD2E7  INSERTTYPECONV,  line 380 of ncnv.pas
  $004CC597  TYPECHECKPASS_INTERNAL,  line 81 of pass_1.pas
  $00552EFB  STATEMENT_BLOCK,  line 1367 of pstatmnt.pas
  $00538079  BLOCK,  line 381 of psub.pas
  $00439919  COMPILE,  line 395 of parser.pas
  $00416674  COMPILE,  line 278 of compiler.pas

Bart

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7605
Re: Issues with CP_UTF7
« Reply #7 on: May 13, 2019, 10:26:39 am »
Why would that be necessary?

To encode literals in a certain encoding, the compiler needs tables. A few common ones are builtin, weird ones not.  I believe the tables can be made using utils/creumap or something.

Bart

  • Hero Member
  • *****
  • Posts: 3547
    • Bart en Mariska's Webstek
Re: Issues with CP_UTF7
« Reply #8 on: May 13, 2019, 11:28:03 am »
Why would that be necessary?

To encode literals in a certain encoding, the compiler needs tables. A few common ones are builtin, weird ones not.  I believe the tables can be made using utils/creumap or something.

If that were the case then assigning a CP_ACP string to a CP_UTF7 sting would then also have to fail?
But the example shows that the contents of the CP_UTF7 string are correct after assigning.

I agree that CP_UTF7 is a weird one  :)

Bart

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7605
Re: Issues with CP_UTF7
« Reply #9 on: May 13, 2019, 11:35:12 am »
If that were the case then assigning a CP_ACP string to a CP_UTF7 sting would then also have to fail?
But the example shows that the contents of the CP_UTF7 string are correct after assigning.

But that might not be handled compiletime, and the runtime conversions go over OS/iconv conversion routines. Check generated assembler to see if the compiler handles a given conversion or not.

Note that this is not my strong point, my comments in this threads are intuition about things/directions to check, not absolute truths.

As far as I can see utf7 is next to useless, but testing the codepage system is not useless, so I go along with it :-)

Bart

  • Hero Member
  • *****
  • Posts: 3547
    • Bart en Mariska's Webstek
Re: Issues with CP_UTF7
« Reply #10 on: May 13, 2019, 01:26:39 pm »
Should I ask on fpc-devel ML?

Bart

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7605
Re: Issues with CP_UTF7
« Reply #11 on: May 13, 2019, 02:07:57 pm »
If you can't check the asm yourself, maybe that is better. You can also make two examples (acs:=utf7 and vice versa), and I'll check.

But if utf7 can't be implemented using lookup tables, it will be hard. The mechanism is simpler in the compiler

Bart

  • Hero Member
  • *****
  • Posts: 3547
    • Bart en Mariska's Webstek
Re: Issues with CP_UTF7
« Reply #12 on: May 13, 2019, 02:30:12 pm »
If you can't check the asm yourself, maybe that is better. You can also make two examples (acs:=utf7 and vice versa), and I'll check.

Well, it does not compile, so there is no assembler output I would think.

And my assemble knowlegde is zero (at best).

Bart

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7605
Re: Issues with CP_UTF7
« Reply #13 on: May 13, 2019, 02:44:54 pm »
If you can't check the asm yourself, maybe that is better. You can also make two examples (acs:=utf7 and vice versa), and I'll check.

Well, it does not compile, so there is no assembler output I would think.

And my assemble knowlegde is zero (at best).

It might not just be literal conversion tables, it also can be that the compiler has an hard limit on allowed mappings somewhere (e.g. the ones that come from the unicode consortium tables).

Better ask on fpc-devel.

Bart

  • Hero Member
  • *****
  • Posts: 3547
    • Bart en Mariska's Webstek
Re: Issues with CP_UTF7
« Reply #14 on: May 13, 2019, 10:37:56 pm »
It might not just be literal conversion tables, it also can be that the compiler has an hard limit on allowed mappings somewhere (e.g. the ones that come from the unicode consortium tables).

Better ask on fpc-devel.

I did.

I also reported the compiler crash in the bugtracker.

Bart