Recent

Author Topic: Readln and UTF16 TextFiles  (Read 1507 times)

HomeBoy38

  • New Member
  • *
  • Posts: 20
Readln and UTF16 TextFiles
« on: September 15, 2019, 05:07:49 pm »
Hi,

I am trying some simple code to read textfiles:
 var
  MyF, MyDestFile: TextFile;
  MyLine: WideString;
 begin
  AssignFile(MyF, 'c:\temp\test.txt');
  AssignFile(MyDestFile, 'c:\temp\result.txt');
  try
   Reset(MyF);
   Rewrite(MyDestFile);

   while not EOF(MyF) do
   begin
    ReadLn(MyF, MyLine);
    WriteLn(MyDestFile, MyLine);
   end;

  except
   Raise;
  end;
 end;

If my file is ANSI, I define a String variable and do "MyLine := UTF8ToUTF16(WinCPToUTF8(MyString));" to have it in my widestring.
If my file is UTF8, no explicit conversion needed. Honestly, I would have thought my code would have failed in that case.
If my file is Unicode (Little or Big Endian), the readln seems to fail (for instance, I cannot display/write in a memo my line).

I used to play with a tfilestream but as in my case reading the file line by line is important, readln made my code simplier, clearer and faster.

What am I missing?

Thanks
« Last Edit: September 30, 2019, 06:09:21 pm by HomeBoy38 »

jamie

  • Hero Member
  • *****
  • Posts: 1997
Re: Readln and UTF16 TextFiles
« Reply #1 on: September 15, 2019, 05:35:49 pm »
when dealing with 2 byte wide chars, you need to convert down to single byte for any of the LCL controls that I know of to work..
 
  UTF8 seems to be the norm these days for the LCL

  of course there is always that chance of translation losses.


HomeBoy38

  • New Member
  • *
  • Posts: 20
Re: Readln and UTF16 TextFiles
« Reply #2 on: September 15, 2019, 05:54:30 pm »
I agree I need to make some conversion, but my issue seems to be at readln time as I am not able to convert the myline to something utf8

Bart

  • Hero Member
  • *****
  • Posts: 3518
    • Bart en Mariska's Webstek
Re: Readln and UTF16 TextFiles
« Reply #3 on: September 15, 2019, 10:44:24 pm »
See the fpc-pascal ML topic: Read lines into UnicodeString variable from UCS2 (UTF-16) encoded text file

My stupid an lazy workaround, probably not suitable for larger files.

Code: Pascal  [Select]
  1. {$mode objfpc}
  2. {$h+}
  3. uses
  4.   sysutils;
  5.  
  6. type
  7.   TUCS2TextFile = file of WideChar;
  8.  
  9. procedure ReadLine(var F: TUCS2TextFile; out S: UnicodeString);
  10. var
  11.   WC: WideChar;
  12. begin
  13.   //Assume file is opend for read
  14.   S := '';
  15.   while not Eof(F) do
  16.   begin
  17.     Read(F, WC);
  18.     if WC = WideChar(#$000A) then
  19.       exit
  20.     else
  21.       if (WC <> WideChar(#$000D)) and (WC<>WideChar(#$FEFF {Unicode LE
  22. BOM})) then S := S + WC;
  23.   end;
  24. end;
  25.  
  26. var
  27.   UFile: TUCS2TextFile;
  28.   US: UnicodeString;
  29. begin
  30.   AssignFile(UFile, 'ucs2.txt');
  31.   Reset(Ufile);
  32.   while not Eof(UFile) do
  33.   begin
  34.     ReadLine(UFile, US);
  35.     writeln('US = ',US);
  36.   end;
  37.   CloseFile(UFile);
end.

Outputs
Code: [Select]
US = Line1
US = Line2
US = Line3
which is correct for my test file (Unicode LE encoding created with Notepad).

Bart

winni

  • Sr. Member
  • ****
  • Posts: 339
Re: Readln and UTF16 TextFiles
« Reply #4 on: September 15, 2019, 11:04:43 pm »
For easier hacking:

The BOMs are constants in unit LConvEncoding:

Code: Pascal  [Select]
  1. const
  2.   UTF8BOM = #$EF#$BB#$BF;
  3.   UTF16BEBOM = #$FE#$FF;
  4.   UTF16LEBOM = #$FF#$FE;
  5.   UTF32BEBOM = #0#0#$FE#$FF;
  6.   UTF32LEBOM = #$FE#$FF#0#0;
  7.  

Winni

HomeBoy38

  • New Member
  • *
  • Posts: 20
Re: Readln and UTF16 TextFiles
« Reply #5 on: September 16, 2019, 12:23:20 pm »
I used to load the file in memory but because of large files (around 150 MB), I had to find a more suitable solution. I used to read the file with a tfilestream but when trying readln, as I said, it was a good solution until I tested utf16.

My intention is not to redevelop readln, but more understand why this one dors not fit despote what I thought it was doing.

Thanks for tout suggestions, it might help mater if it is the only valid option.

Note: I do not have utf32 files, none of my programs handle them, do you see such files in your environments ?

HomeBoy38

  • New Member
  • *
  • Posts: 20
Re: Readln and UTF16 TextFiles
« Reply #6 on: September 28, 2019, 01:38:29 pm »
Anyone have an idea why I cannot use ReadLN with Unicode files?

winni

  • Sr. Member
  • ****
  • Posts: 339
Re: Readln and UTF16 TextFiles
« Reply #7 on: September 28, 2019, 01:46:17 pm »
Just NO !

Show us some code please.

wp

  • Hero Member
  • *****
  • Posts: 6235
Re: Readln and UTF16 TextFiles
« Reply #8 on: September 28, 2019, 01:51:29 pm »
Just guessing: Maybe the ReadLn searches for the line end as #13#10, but in a Unicode file this is #0013#0010.

I agree with winni: Please post the file that you want to read and some compilable demo code that you use to read the file. Otherwise there are useless misunderstandings and endless discussions.
Lazarus trunk / fpc 3.0.4 / all 32-bit on Win-10

winni

  • Sr. Member
  • ****
  • Posts: 339
Re: Readln and UTF16 TextFiles
« Reply #9 on: September 28, 2019, 01:54:25 pm »
Perhaps your first line is empty?

Test it that way:

Code: Pascal  [Select]
  1.  var
  2.   MyF: TextFile;
  3.   MyLine: String;
  4.   i: Integer = 0;
  5.  begin
  6.   AssignFile(MyF, 'c:\temp\test.txt');
  7.    Reset(MyF);
  8.    while not EOF (MyF) do
  9.    begin
  10.    ReadLn(MyF, MyLine);
  11.    if i Mod 100 = 0  then
  12.        begin
  13.        Label1.Caption := MyLine;
  14.       Application.ProcessMessages;
  15.        end;
  16.    inc(i);
  17.   end;
  18.  end;
  19.  

As you see you need to put a label on your forml.

Winni

Bart

  • Hero Member
  • *****
  • Posts: 3518
    • Bart en Mariska's Webstek
Re: Readln and UTF16 TextFiles
« Reply #10 on: September 28, 2019, 05:06:20 pm »
Anyone have an idea why I cannot use ReadLN with Unicode files?

Because this is not supported by the compiler.
And before you complain: Delphi does not support that either.

Bart

Thaddy

  • Hero Member
  • *****
  • Posts: 8952
Re: Readln and UTF16 TextFiles
« Reply #11 on: September 28, 2019, 06:42:08 pm »
OTOH Streams DO support Unicode16 in trunk (and possibly in 3.2.0 too). It seems that read/writeAnsiString also supports other compiler supported pascal string types. See classesh.inc and streams.inc.
« Last Edit: September 28, 2019, 07:00:57 pm by Thaddy »
Most people that want to use threading should learn to patch their jeans first: use a needle.

HomeBoy38

  • New Member
  • *
  • Posts: 20
Re: Readln and UTF16 TextFiles
« Reply #12 on: September 30, 2019, 06:43:00 pm »
winni: I did not get your i point, but I tried it without more luck. My first line was effectively empty, but removing it was not the solution. I tried type String and WideString without success.

Thaddy: I did not get your advice either, can you be a little more explicit?

wp: I modified a little bit my first post regarding my source code, I kept only what is relevant I guess. The file I am reading is 95MB long. It is an ANSI file I saved as Unicode with notepad. You should be able to reproduce my case easily. If not, let me know your code :)
Reading your post, I decided to, instead of trying to display the string or write it, echo the length: I have a line with length=1 then I have a line with the correct length-1.

Any suggestions are welcome, hopefully this will help other people too...

wp

  • Hero Member
  • *****
  • Posts: 6235
Re: Readln and UTF16 TextFiles
« Reply #13 on: September 30, 2019, 07:26:33 pm »
Is the file in the attachment one which you cannot read? It is Unicode/Little Endian with BOM.
Lazarus trunk / fpc 3.0.4 / all 32-bit on Win-10

HomeBoy38

  • New Member
  • *
  • Posts: 20
Re: Readln and UTF16 TextFiles
« Reply #14 on: September 30, 2019, 07:35:12 pm »
I tried your file and I have the same issue.
I also tried loading my file into a TFileStream, then use ReadLine from TStreamReader, I have the same problem