Recent

Author Topic: Encoding problems  (Read 4767 times)

firosiro

  • Newbie
  • Posts: 1
Encoding problems
« on: September 23, 2017, 05:26:07 pm »
..
« Last Edit: January 15, 2018, 07:56:39 am by firosiro »

fred

  • Full Member
  • ***
  • Posts: 205
Re: Encoding problems
« Reply #1 on: September 23, 2017, 05:45:19 pm »
You could try the codepage conversion in unit LConvEncoding with a function like CP1252ToUTF8() as in
Code: Pascal  [Select][+][-]
  1. ShowMessage(CP1252ToUTF8(s));

Edit: See also http://wiki.freepascal.org/Unicode_Support_in_Lazarus#Reading_.2F_writing_text_file_with_Windows_codepage
« Last Edit: September 23, 2017, 06:07:06 pm by fred »

Thaddy

  • Hero Member
  • *****
  • Posts: 18704
  • To Europe: simply sell USA bonds: dollar collapses
Re: Encoding problems
« Reply #2 on: September 23, 2017, 05:57:06 pm »
Note Lazarus 1.6.2 isn't exactly up-to date (it is very old). So it might have been fixed already.
Note2: we can't properly help when we do not know the FPC version?
Always give us fpc version AND lazarus version.

Note: versions that are out of maintenance will not receive fixes (except in some particular security related circumstances, i guess).

You may want to declare your string type as AnsiString.... Lazarus has a rather opinionated implementation of what string should mean. (UTF8)
In your case: I suggest to replace string with AnsiString. (or  unicodestring).
« Last Edit: September 23, 2017, 06:10:11 pm by Thaddy »
If Europe sells their USA bonds the USD will collapse. Europe can affort that given average state debts. The USA can't affort that. Just an advice...

wp

  • Hero Member
  • *****
  • Posts: 13334
Re: Encoding problems
« Reply #3 on: September 23, 2017, 06:35:07 pm »
Your old version Laz 1.4.4 was packaged with fpc 2.6.4, and your "new" (outdated now...) version 1.6.2 came with fpc 3.0.x (x=0 or 2). Between these two version fpc changed string handling by introducing code-page aware strings (http://wiki.freepascal.org/User_Changes_3.0#AnsiStrings_are_now_codepage-aware). As a consequence, some old functions are not working any more as expected: UTF8Encode, UTF8Decode, AnsitoUtf8, SystoUtf8, etc. Read the wiki article already cited by fred. Seek the forum, there are several posts in which Juha explained the new strings

For conversion from Ansi to UTF8 now you have these options:
- Declare a string type for your particular code page and assign the byte array to it
- Use explicit conversion functions, a lot of them can be found in LConvEncoding and LazUtf8.

See this (working) code snippet:
Code: Pascal  [Select][+][-]
  1. uses
  2.   LazUtf8, LConvEncoding;
  3.  
  4. { TForm1 }
  5.  
  6. procedure TForm1.Button1Click(Sender: TObject);
  7. type
  8.   string1252 = type ansistring(1252);
  9. var
  10.   a: array of byte;
  11.   s: String;
  12.   s1252: string1252;
  13.   sa: ansistring;
  14. begin
  15.   SetLength(a, 1);
  16.   a[0] := $e4;
  17.   SetString(s, PansiChar(@a[0]), length(a));
  18.  
  19.   Memo1.Lines.Add('OLD CODE');
  20.   // old code -- no longer working
  21.   Memo1.Lines.Add(s);
  22.   Memo1.Lines.Add('UTF8Encode --> ' + UTF8Encode(s));
  23.  
  24.   // new code
  25.   Memo1.Lines.Add('');
  26.   Memo1.Lines.Add('NEW CODE');
  27.   SetString(s1252, PAnsiChar(@a[0]), Length(a));
  28.   Memo1.Lines.Add('type ansistring(1252) --> ' + s1252);
  29.  
  30.   SetString(sa, PAnsiChar(@a[0]), Length(a));
  31.   Memo1.Lines.Add('Cp1252ToUTF8 --> ' + Cp1252ToUTF8(sa)); // Needs LConvEncoding
  32.   Memo1.Lines.Add('WinCPToUTF8 --> ' + WinCPToUTF8(sa));   // Needs LazUtf8
  33.  
  34.   Memo1.Lines.Add('Cp1252ToUTF8 --> ' + Cp1252ToUTF8(s));  // Needs LConvEncoding
  35.   Memo1.Lines.Add('WinCPToUTF8 --> ' + WinCPToUTF8(s));    // Needs LazUtf8
  36. end;

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4660
  • I like bugs.
Re: Encoding problems
« Reply #4 on: September 23, 2017, 07:05:09 pm »
@firosiro, just use Unicode everywhere and you are good.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

wp

  • Hero Member
  • *****
  • Posts: 13334
Re: Encoding problems
« Reply #5 on: September 23, 2017, 07:35:19 pm »
Juha, this is not possible always. firosiro's code converting a byte array to a string suggests that this is one of these hardware-driven/file-format-driven conversion cases in which the automatics of fpc 3 fail.

Thaddy

  • Hero Member
  • *****
  • Posts: 18704
  • To Europe: simply sell USA bonds: dollar collapses
Re: Encoding problems
« Reply #6 on: September 23, 2017, 08:18:02 pm »
Yup.
If Europe sells their USA bonds the USD will collapse. Europe can affort that given average state debts. The USA can't affort that. Just an advice...

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Encoding problems
« Reply #7 on: September 24, 2017, 03:02:11 pm »
Juha, this is not possible always. firosiro's code converting a byte array to a string suggests that this is one of these hardware-driven/file-format-driven conversion cases in which the automatics of fpc 3 fail.

I disagree.

In your example referring to "old code":
Code: Pascal  [Select][+][-]
  1. var
  2.   a: array of byte;
  3.   s: String;
  4. begin
  5.   SetLength(a, 1);
  6.   a[0] := $e4;
  7.   SetString(s, PansiChar(@a[0]), length(a));
  8.  
  9.   Memo1.Lines.Add('OLD CODE');
  10.   // old code -- no longer working
  11.   Memo1.Lines.Add(s);
  12.   Memo1.Lines.Add('UTF8Encode --> ' + UTF8Encode(s));

Variable s needs its correct code page. Simply set the correct code page:
Code: Pascal  [Select][+][-]
  1. var
  2.   a: array of byte;
  3.   s: String;
  4. begin
  5.   SetLength(a, 1);
  6.   a[0] := $e4;
  7.   SetString(s, PansiChar(@a[0]), length(a));
  8.   SetCodePage(RawByteString(s), 1252, False);

Then you'll notice that both Adds will work:
Code: Pascal  [Select][+][-]
  1.   Memo1.Lines.Add(s);
  2.   Memo1.Lines.Add('UTF8Encode --> ' + UTF8Encode(s));

The first Add worked by accident, Lazarus expects UTF8 strings. Under Windows, at least, it uses UTF8ToUTF16 without checking the code page of the string. This is possibly a bug, maybe a typecast to UnicodeString would be better.

Only the second line would work when using ShowMessage as in the first post of this thread and as firosiro expects based on the comments:
Code: Pascal  [Select][+][-]
  1.  ShowMessage(s); //ofcourse not work, OK!
  2.  ShowMessage(UTF8Encode(s)); //not work, why???

wp

  • Hero Member
  • *****
  • Posts: 13334
Re: Encoding problems
« Reply #8 on: September 24, 2017, 03:52:54 pm »
I disagree, too.

I was saying, in other words, that this might be a case in which FPC will not get the right encoding automatically. With "automatically" I mean: Just declare a variable as "string", nothing else. You are just confirming my statement because you add extra code specifying the codepage. Your code does not even compile with fpc 2.6.4 (Laz 1.4.4), therefore, it does not belong to section"Old code", it belongs to "New code" and is a third option how to handle the encoding correctly today, thanks for bringing this up.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Encoding problems
« Reply #9 on: September 24, 2017, 04:41:14 pm »
I disagree, too.

I was saying, in other words, that this might be a case in which FPC will not get the right encoding automatically. With "automatically" I mean: Just declare a variable as "string", nothing else. You are just confirming my statement because you add extra code specifying the codepage. Your code does not even compile with fpc 2.6.4 (Laz 1.4.4), therefore, it does not belong to section"Old code", it belongs to "New code" and is a third option how to handle the encoding correctly today, thanks for bringing this up.

OK, I will not disagree, but I am not sure why you think it is FPC problem. FPC trusts DefaultSystemCodePage value to be correct, while LCL changes it to UTF8, that's one. Two, when you declare a variable as "string" that means, for LCL, it is UTF8 encoding. Assigning non-UTF8 value, like #$e4, is an error.

Actually having a procedure like SetString that does not take code page as one of the parameters does not make sense, as well. In fact FPC has a compiler procedure named fpc_setstring_ansistr_pansichar could be exposed as in:
Code: Pascal  [Select][+][-]
  1. procedure SetStringCP(out S : RawByteString; Buf : PAnsiChar; Len : SizeInt; cp: TSystemCodePage);external name 'fpc_setstring_ansistr_pansichar';

Then simply:
Code: Pascal  [Select][+][-]
  1.  SetStringCp(RawByteString(s), PansiChar(@a[0]), Length(a), 1252);

And to satisfy LCL UTF8 requirement:
Code: Pascal  [Select][+][-]
  1.  ShowMessage(UTF8String(s));
or any other method.

By the way, IMHO, WinCPToUTF8 is the worst to use.

 

TinyPortal © 2005-2018