Recent

Author Topic: Help reading an ANSI text file correctly...?  (Read 22492 times)

Espectr0

  • Full Member
  • ***
  • Posts: 218
Re: Help reading an ANSI text file correctly...?
« Reply #15 on: February 05, 2016, 11:59:32 am »
Why Lazarus "TEncoding.GetBufferEncoding" detects ANSI as Unicode? in Delphi it detect correctly.
Also in Delphi if I pass a var as "TEncoding.ANSI" it read buffer as ANSI but not in Lazarus.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4467
  • I like bugs.
Re: Help reading an ANSI text file correctly...?
« Reply #16 on: February 05, 2016, 12:04:05 pm »

...And I also wonder why AnsiToUtf8 or SysToUtf8 are not working (neither on fpc 2.6.4 nor on fpc 3.0).

AnsiToUtf8 is not work properly under utf-8 enabled LCL.
Try switch "disableutf8rtl"

According to wp it does not work with FPC 2.6.4 either. Thus the reason is not UTF-8 enabled RTL because it cannot be enabled for FPC 2.6.4.
Also the function WinCPToUTF8 uses AnsiToUTF8. Not on Windows.
I would also like to understand why it does not work with FPC 2.6.4. Was ISO_8859_15 really the Windows system codepage in this case? Maybe not.
But yes, AnsiToUTF8 has no effect with FPC 3.x using the new UTF-8 system. See
  http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus

[Edit]
The wiki page has a section for Windows codepage:
  http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#Reading_text_file_with_Windows_codepage
but it has only little information.
Maybe it should give example about how to get or guess the system encoding.

Adding this here ...
Why Lazarus "TEncoding.GetBufferEncoding" detects ANSI as Unicode? in Delphi it detect correctly.
Also in Delphi if I pass a var as "TEncoding.ANSI" it read buffer as ANSI but not in Lazarus.

The new UTF-8 support is not Delphi compatible. It is kind of a semi-hack (*) and sets the default encoding of AnsiString to UTF-8.
It means the Windows system codepage is gone.
If your code depends a lot on system codepage then you may have to disable the new UTF-8 system. See:
  http://wiki.freepascal.org/Lazarus_with_FPC3.0_without_UTF-8_mode
However I recommend encapsulating the conversions for I/O and then using UTF-8 everywhere.

(*) Semi-hack means it is less of a hack than the old UTF-8 support was. Earlier explicit conversion functions were needed often. Now conversions happen automatically once a String's dynamic codepage is set right.
In fact the new system looks amazingly Delphi compatible if you only deal with Unicode. Typical code only seldom deals with individual codepoints beyond lower ASCII area.
« Last Edit: February 05, 2016, 03:26:43 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Espectr0

  • Full Member
  • ***
  • Posts: 218
Re: Help reading an ANSI text file correctly...?
« Reply #17 on: February 05, 2016, 01:11:19 pm »
I find a solution! (in my case)

my original code, work great for all text files except ANSI:

Code: Pascal  [Select][+][-]
  1. ...
  2. var
  3.   Stream   : TStream;
  4.   Size     : Integer;
  5.   Buffer   : TBytes;
  6.   FileName : String;
  7.   Encoding : TEncoding;
  8. begin
  9.   Encoding := NIL;
  10.   FileName := 'd:\textfile.txt';
  11.  
  12.   Stream := TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite);
  13.   try
  14.     Size := Stream.Size - Stream.Position;
  15.     SetLength(Buffer, Size);
  16.     Stream.Read(Buffer[0], Size);
  17.     Size := TEncoding.GetBufferEncoding(Buffer, Encoding, TEncoding.ANSI);
  18.     Memo1.Text := Encoding.GetString(Buffer, Size, Length(Buffer) - Size);
  19.   finally
  20.     Stream.Free;
  21.   end;
  22. ...
  23.  

but if I change the line:

Code: Pascal  [Select][+][-]
  1. Size := TEncoding.GetBufferEncoding(Buffer, Encoding, TEncoding.ANSI);

to

Code: Pascal  [Select][+][-]
  1. Size := TEncoding.GetBufferEncoding(Buffer, Encoding, TEncoding.GetEncoding(1252));


Work great for all text files tested for me (ansi, utf8 and unicode)! :D

wp

  • Hero Member
  • *****
  • Posts: 11916
Re: Help reading an ANSI text file correctly...?
« Reply #18 on: February 05, 2016, 03:28:02 pm »
As noted above, it is also confusing that it is not possible to read the file directly into the lines of the memo by calling Memo.Lines.LoadFromFile(), and do the conversion afterwards - i tried all codepages defined in LConvEncoding, none of them produces the correct result. Only if I read the file into a separate string, convert it and assign it to the memo lines then the encoding is correct. If I assume that the string is UTF8 I get the same (wrong) result in both cases.

Therefore, I'd speculate that TMemo implicitly converts a string to UTF8. In the sources of TMemo, I see that its Lines are type TTextStrings which use a TFileStreamUtf8 for reading and writing - but I thought that the "utf8" in this stream name refers to the file name only, not to the contents.

Could somebody please explain?

If someone wants to repeat these experiments please see the attached demo.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4467
  • I like bugs.
Re: Help reading an ANSI text file correctly...?
« Reply #19 on: February 05, 2016, 03:49:00 pm »
Work great for all text files tested for me (ansi, utf8 and unicode)! :D

Well, UTF-8 is part of Unicode, one of its encodings.
Your code is overly complex for such a simple task, reading a text file.
The wiki link I gave earlier says:
Code: Pascal  [Select][+][-]
  1. SetCodePage(RawByteString(StrIn), 1252, false);  // 1252 always !! (or Windows.GetACP())
and it is actually all you need. Then read text into the string and it will be converted automatically to UTF-8 when needed.
You could also test how well Windows.GetACP() works.

Therefore, I'd speculate that TMemo implicitly converts a string to UTF8. In the sources of TMemo, I see that its Lines are type TTextStrings which use a TFileStreamUtf8 for reading and writing - but I thought that the "utf8" in this stream name refers to the file name only, not to the contents.

Could somebody please explain?

No I cannot explain but TMemo maps to a native system GUI component. Anything can happen.
Just use a String with right encoding and you are good.

Quote
If someone wants to repeat these experiments please see the attached demo.

I don't even have Windows currently.
Other people should take initiative and test + improve the wiki if needed.
« Last Edit: February 05, 2016, 03:52:08 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Bart

  • Hero Member
  • *****
  • Posts: 5290
    • Bart en Mariska's Webstek
Re: Help reading an ANSI text file correctly...?
« Reply #20 on: February 05, 2016, 06:53:04 pm »
Therefore, I'd speculate that TMemo implicitly converts a string to UTF8.

Don't speculate, look at the sources.
By default LCL treats all strings as having a UTF8 encoding.
It does not check this.
It's how it was designed (before cp aware strings).

If you supply strings to the LCL that are nor UTF8 you'll have te encode them to UTF8 first.
This is why we have SysToUtf8 (and other conversion routines).

When reading text from a file (LoadFromFile), the LCL does not know what encoding is used.
It's your responsibiity to make sure it'll be handled correctly.
You can happily do a TStrings.LoadFromFile on a binary file.
The contents will make no sense of course, but still you can.

When dealing with the OS/WidgetSet (e.g. when putting the text in a Memo and showing it on a from) the LCL will offer strings to the Windows API as WideStrings.
It therefore explicitely converts the strings (which, as I pointed out earlier, are treated as UTF8) to WideString using Utf8ToUTF16 or Utf8Decode.

And this will give unexpected results if the strings you provide (read from the file) are not UTF8.

In the sources of TMemo, I see that its Lines are type TTextStrings which use a TFileStreamUtf8 for reading and writing - but I thought that the "utf8" in this stream name refers to the file name only, not to the contents.

That is correct.

Could somebody please explain?

Hope I did.

Bart

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4467
  • I like bugs.
Re: Help reading an ANSI text file correctly...?
« Reply #21 on: February 05, 2016, 07:32:42 pm »
If you supply strings to the LCL that are nor UTF8 you'll have te encode them to UTF8 first.
This is why we have SysToUtf8 (and other conversion routines).

That is not true any more with the new "better" UTF-8 system.

You don't need to explicitly convert encodings. You even cannot do it with SysToUtf8 any more.
AnsiToUTF8, UTF8ToAnsi, SysToUTF8, UTF8ToSys are all practically no-ops now.
You only must make sure a String's dynamic encoding is correct. Conversions happen automatically then.
The earlier example:
Code: Pascal  [Select][+][-]
  1. SetCodePage(RawByteString(StrIn), 1252, false);  // 1252 always !! (or Windows.GetACP())
does just that. It sets the string's encoding to match with the actual data.
Conversion will happen when the string is assigned to another string with different encoding.
So, after one call to SetCodePage() everything just magically works.

Can somebody verify that Windows.GetACP() works and allows to make code for any Windows locale?
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

wp

  • Hero Member
  • *****
  • Posts: 11916
Re: Help reading an ANSI text file correctly...?
« Reply #22 on: February 05, 2016, 08:41:24 pm »
Juha, I read about this SetCodePage many times. But how do I apply it if I want to read a non-UTF8 file into a stringlist or a memo? The straightforward way is to call StringList.LoadFromFile or even Memo.Lines.LoadFromFile (ok - as Bart explained above the latter one converts internally to UTF8 and in my opinion should NEVER be used with non-UTF8 files).

As I understand from the wiki the SetCodePage has to be called "in advance", i.e. before reading the file into it. But using the standard StringList.LoadFromFile file content is available as StringList.Text only - I can't call SetCodePage here. Copy StringList.Text to a string which has been pre-processed by "SetCodePage"? I tried - not converted.

The only thing which is working for me is explicit conversions like above CP1252toUTF8.

I had though I had understood the new string system. But obviously I did not...

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4467
  • I like bugs.
Re: Help reading an ANSI text file correctly...?
« Reply #23 on: February 05, 2016, 10:52:47 pm »
@wp, you cannot use StringList.LoadFromFile in that case.
You must read the file contents to a string variable (StrIn), make sure its encoding is right and then do:
Code: Pascal  [Select][+][-]
  1. StringList.Text := StrIn;
Besides, reading first to StringList, then converting StringList.Text and finally assigning back to StringList.Text would be very inefficient because the Text property splits/concatenates lines and allocates potentially many strings without any benefit.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Bart

  • Hero Member
  • *****
  • Posts: 5290
    • Bart en Mariska's Webstek
Re: Help reading an ANSI text file correctly...?
« Reply #24 on: February 05, 2016, 10:59:18 pm »
The earlier example:
Code: Pascal  [Select][+][-]
  1. SetCodePage(RawByteString(StrIn), 1252, false);  // 1252 always !! (or Windows.GetACP())
does just that.

Note: this only works (in this case) after you read the file contents into StrIn.
If you do it before reading (when StrIn is empty) the SetCodePage() is a no-op.

Bart

wp

  • Hero Member
  • *****
  • Posts: 11916
Re: Help reading an ANSI text file correctly...?
« Reply #25 on: February 05, 2016, 11:09:19 pm »
That's clearly a step back -- TStringList.LoadfromFile was so easy! I think it's time that someone modifies this method such that it accepts an optional encoding parameter (similar to Delphi). And it should work also for TMemo.Lines, Listbox.Items etc.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4467
  • I like bugs.
Re: Help reading an ANSI text file correctly...?
« Reply #26 on: February 05, 2016, 11:51:25 pm »
That's clearly a step back -- TStringList.LoadfromFile was so easy! I think it's time that someone modifies this method such that it accepts an optional encoding parameter (similar to Delphi). And it should work also for TMemo.Lines, Listbox.Items etc.

Does Delphi's TStringList.LoadfromFile have such a parameter? Ok, then FPC project may add it.
They may not want to adjust FPC libs for problems caused by Lazarus UTF-8 "hack" in general.
Lazarus classes and components can be modified more easily.
Also a class helper for TStringList could be made. I have not used class helpers myself yet but patches are welcome.

Still, the step back is rather small. UTF-8 is the most common encoding in text files. Other encodings can be seen as a special case, yet they can be used by adding just few lines of code. Using proper helper functions / classes in future it may become just one line.
No worries ...
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

wp

  • Hero Member
  • *****
  • Posts: 11916
Re: Help reading an ANSI text file correctly...?
« Reply #27 on: February 06, 2016, 12:16:07 am »
Does Delphi's TStringList.LoadfromFile have such a parameter?

Quoted from the Delphi XE2 help file of TStrings (the overloads were not there yet at the times of D7):

Code: Text  [Select][+][-]
  1. procedure LoadFromFile(const FileName: string); overload; virtual;
  2. procedure LoadFromFile(const FileName: string; Encoding: TEncoding); overload; virtual;
  3.  
  4. procedure SaveToFile(const FileName: string); overload; virtual;
  5. procedure SaveToFile(const FileName: string; Encoding: TEncoding); overload; virtual;

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4467
  • I like bugs.
Re: Help reading an ANSI text file correctly...?
« Reply #28 on: February 09, 2016, 10:52:29 am »
The earlier example:
Code: Pascal  [Select][+][-]
  1. SetCodePage(RawByteString(StrIn), 1252, false);  // 1252 always !! (or Windows.GetACP())
does just that.
Note: this only works (in this case) after you read the file contents into StrIn.
If you do it before reading (when StrIn is empty) the SetCodePage() is a no-op.

Wrong. Setting dynamic codepage for an empty string should work. It is a bug if it does not work.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Bart

  • Hero Member
  • *****
  • Posts: 5290
    • Bart en Mariska's Webstek
Re: Help reading an ANSI text file correctly...?
« Reply #29 on: February 09, 2016, 06:19:16 pm »
Wrong. Setting dynamic codepage for an empty string should work. It is a bug if it does not work.

Wrong  >:D

Code: [Select]
procedure SetCodePage(var s : RawByteString; CodePage : TSystemCodePage; Convert : Boolean = True);
  var
    TranslatedCodePage,
    TranslatedCurrentCodePage: TSystemCodePage;
  begin
    if (S='') then
      exit;
  ....

I distinctly remember this, because I made such a mistake when implementing the Utf8Insert/Utf8Delete with Utf8String parameters, and got the comment: "the code in this revision makes no sense at all"  :-[

Bart

 

TinyPortal © 2005-2018