Recent

Author Topic: Readln and UTF16 TextFiles  (Read 25366 times)

HomeBoy38

  • Jr. Member
  • **
  • Posts: 59
Re: Readln and UTF16 TextFiles
« Reply #75 on: July 24, 2020, 03:03:55 pm »
I created https://bugs.freepascal.org/view.php?id=37416
I hope I made things enough valuable...

Handoko, have a good night. MarkMLl, enjoy the day

MarkMLl

  • Hero Member
  • *****
  • Posts: 8317
Re: Readln and UTF16 TextFiles
« Reply #76 on: July 24, 2020, 03:13:49 pm »
Yes, definitely right to raise a bug report.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Logitech, TopSpeed & FTL Modula-2 on bare metal (Z80, '286 protected mode).
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

Handoko

  • Hero Member
  • *****
  • Posts: 5396
  • My goal: build my own game engine using Lazarus
Re: Readln and UTF16 TextFiles
« Reply #77 on: July 25, 2020, 09:02:55 pm »
I did more tests and these are what I found. I also provide the source codes and some screenshots for better describe the issues. Hope it can be useful for someone who want to fix/improve UTF-16 ReadLn and WriteLn.

Note:
Previously I use array of char for reading the buffer directly. I was wrong, it should be array of word. You can try to replace my codes below with array of byte (or char) and see weird result.

So far, the problems we found are:

1. WriteLn does not correctly write EOL characters for UTF-16 data
2. ReadLn does not give correct result when reading UTF-16 text file
3. ReadLn result is not consistent on Console Mode vs GUI and Linux vs Windows

For issue #1, there were some possible solutions suggested by MarkMLI:
https://forum.lazarus.freepascal.org/index.php/topic,46759.msg370898.html#msg370898

For issue #2, see image Sample Data (GHex n LibreOffice).png and ReadLn-Console vs ReadLn-GUI (Linux).png. The original sample data is shown using GHex binary editor/viewer and LibreOffice Writer. And the data read into memory using ReadLn is shown on Console and GUI modes. There were different. The original sample.txt has 4 lines but ReadLn reads the file as 9 lines and the hex values are not exactly the same. The Linux Console version is pretty accurate. For easier comparing, I mark the every second hex values green.

For issue #3
, in ReadLn-Console vs ReadLn-GUI (Linux).png we can see the their hex values are not exactly the same. On ReadLn-GUI Linux vs Windows.png we can see the original $FFFE becomes $FFFDFFFD (you can modify the code to see it clearly) on Windows but $3F3F on Linux. Also the original $00E9 becomes $0000FFFD on Windows and $003F on Linux.

The tests were performed on Lazarus 2.0.10 GTK Linux and Windows 7. The sample.txt and source codes (readlnconsole & readlngui) are provided for download.

This is ReadLnConsole:
Code: Pascal  [Select][+][-]
  1. program readlnconsole;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. uses
  6.   {$IFDEF UNIX}{$IFDEF UseCThreads}
  7.   cthreads,
  8.   {$ENDIF}{$ENDIF}
  9.   Classes, sysutils, Crt
  10.   { you can add units after this };
  11.  
  12. var
  13.   Buffer:  WideString;
  14.   Data:    array of Word absolute Buffer;
  15.   aFile:   TextFile;
  16.   Line:    Integer;
  17.   i:       Integer;
  18.  
  19. begin
  20.   Line := 0;
  21.   AssignFile(aFile, 'sample.txt');
  22.   Reset(aFile);
  23.   while not EOF(aFile) do
  24.   begin
  25.     ReadLn(aFile, Buffer);
  26.     Inc(Line);
  27.     TextColor(White);
  28.     Write(Line.ToString);
  29.     for i := 0 to Length(Buffer)-1 do
  30.     begin
  31.       case Odd(i) of
  32.         True:
  33.           begin
  34.             TextColor(Green);
  35.           end;
  36.         False:
  37.           begin
  38.             Write(' ');
  39.             TextColor(White);
  40.           end;
  41.       end;
  42.       Write(IntToHex(Data[i], 2));
  43.     end;
  44.     WriteLn;
  45.   end;
  46.   CloseFile(aFile);
  47. end.

And this is ReadLnGUI:
Code: Pascal  [Select][+][-]
  1. unit Unit1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. interface
  6.  
  7. uses
  8.   Classes, SysUtils, Forms, Controls, Graphics, StdCtrls;
  9.  
  10. type
  11.  
  12.   { TForm1 }
  13.  
  14.   TForm1 = class(TForm)
  15.     Button1: TButton;
  16.     procedure Button1Click(Sender: TObject);
  17.   end;
  18.  
  19. var
  20.   Form1: TForm1;
  21.  
  22. implementation
  23.  
  24. {$R *.lfm}
  25.  
  26. { TForm1 }
  27.  
  28. procedure TForm1.Button1Click(Sender: TObject);
  29. var
  30.   Buffer:  WideString;
  31.   Data:    array of Word absolute Buffer;
  32.   aFile:   TextFile;
  33.   Line:    Integer;
  34.   XShift:  Integer;
  35.   i:       Integer;
  36. begin
  37.   Line := 0;
  38.   AssignFile(aFile, 'sample.txt');
  39.   Reset(aFile);
  40.   while not EOF(aFile) do
  41.   begin
  42.     ReadLn(aFile, Buffer);
  43.     Inc(Line);
  44.     Canvas.Brush.Color := clDefault;
  45.     Canvas.Font.Color := clBlack;
  46.     Canvas.TextOut(20, Line*25, 'Line-' + Line.ToString);
  47.     for i := 0 to Length(Buffer)-1 do
  48.     begin
  49.       case Odd(i) of
  50.         True:
  51.           begin
  52.             Canvas.Font.Color := clGreen;
  53.             XShift := 70;
  54.           end;
  55.         False:
  56.           begin
  57.             Canvas.Font.Color := clBlack;
  58.             XShift := 73;
  59.           end;
  60.       end;
  61.       Canvas.Brush.Color := clWhite;
  62.       Canvas.TextOut(i*20+XShift, Line*25, IntToHex(Data[i], 2));
  63.     end;
  64.   end;
  65.   CloseFile(aFile);
  66. end;
  67.  
  68. end.

This is the source codes:

HomeBoy38

  • Jr. Member
  • **
  • Posts: 59
Re: Readln and UTF16 TextFiles
« Reply #78 on: July 25, 2020, 09:26:10 pm »
Thanks Handoko for updating this post and the bug report

MarkMLl

  • Hero Member
  • *****
  • Posts: 8317
Re: Readln and UTF16 TextFiles
« Reply #79 on: July 25, 2020, 10:10:15 pm »
3. ReadLn result is not consistent on Console Mode vs GUI and Linux vs Windows

Does Lazarus still attempt to load a system-defined Unicode manager?

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Logitech, TopSpeed & FTL Modula-2 on bare metal (Z80, '286 protected mode).
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

Handoko

  • Hero Member
  • *****
  • Posts: 5396
  • My goal: build my own game engine using Lazarus
Re: Readln and UTF16 TextFiles
« Reply #80 on: July 26, 2020, 07:01:50 am »
3. ReadLn result is not consistent on Console Mode vs GUI and Linux vs Windows
Does Lazarus still attempt to load a system-defined Unicode manager?

Maybe. But if yes, then how should we code to get uniform results on multi platform both console and GUI?

MarkMLl

  • Hero Member
  • *****
  • Posts: 8317
Re: Readln and UTF16 TextFiles
« Reply #81 on: July 26, 2020, 09:19:10 am »
3. ReadLn result is not consistent on Console Mode vs GUI and Linux vs Windows
Does Lazarus still attempt to load a system-defined Unicode manager?

Maybe. But if yes, then how should we code to get uniform results on multi platform both console and GUI?

Not knowing, can't say. I've just got some vague recollections from when UTF-8 started to make its way into the projects.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Logitech, TopSpeed & FTL Modula-2 on bare metal (Z80, '286 protected mode).
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

paweld

  • Hero Member
  • *****
  • Posts: 1323
Re: Readln and UTF16 TextFiles
« Reply #82 on: July 26, 2020, 02:18:31 pm »
@Handoko -  if you define the Buffer as String then everything is ok:
gui:
Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button1Click(Sender: TObject);
  2. var
  3.   Buffer:  String;
  4.   Data:    array of Byte absolute Buffer;
  5.   aFile:   TextFile;
  6.   Line:    Integer;
  7.   XShift:  Integer;
  8.   i:       Integer;
  9. begin
  10.   Line := 0;
  11.   AssignFile(aFile, 'sample.txt');
  12.   Reset(aFile);
  13.   while not EOF(aFile) do
  14.   begin
  15.     ReadLn(aFile, Buffer);
  16.     Inc(Line);
  17.     Canvas.Brush.Color := clDefault;
  18.     Canvas.Font.Color := clBlack;
  19.     Canvas.TextOut(20, Line*25, 'Line-' + Line.ToString);
  20.     for i := 0 to Length(Buffer)-1 do
  21.     begin
  22.       case Odd(i) of
  23.         True:
  24.           begin
  25.             Canvas.Font.Color := clGreen;
  26.             XShift := 70;
  27.           end;
  28.         False:
  29.           begin
  30.             Canvas.Font.Color := clBlack;
  31.             XShift := 73;
  32.           end;
  33.       end;
  34.       Canvas.Brush.Color := clWhite;
  35.       Canvas.TextOut(i*20+XShift, Line*25, IntToHex(Data[i], 2));
  36.     end;
  37.   end;
  38.   CloseFile(aFile);
  39. end;

console:
Code: Pascal  [Select][+][-]
  1. var
  2.   Buffer:  String;
  3.   Data:    array of Byte absolute Buffer;
  4.   aFile:   TextFile;
  5.   Line:    Integer;
  6.   i:       Integer;
  7.  
  8. begin
  9.   Line := 0;
  10.   AssignFile(aFile, 'sample.txt');
  11.   Reset(aFile);
  12.   while not EOF(aFile) do
  13.   begin
  14.     ReadLn(aFile, Buffer);
  15.     Inc(Line);
  16.     TextColor(White);
  17.     Write(Line.ToString);
  18.     for i := 0 to Length(Buffer)-1 do
  19.     begin
  20.       case Odd(i) of
  21.         True:
  22.           begin
  23.             TextColor(Green);
  24.           end;
  25.         False:
  26.           begin
  27.             Write(' ');
  28.             TextColor(White);
  29.           end;
  30.       end;
  31.       Write(IntToHex(Data[i], 2));
  32.     end;
  33.     WriteLn;
  34.   end;
  35.   CloseFile(aFile);
  36. end.

then we need to change the encoding to utf8 (gui):
Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button1Click(Sender: TObject);
  2. var
  3.   Buffer:  String;
  4.   Data:    array of Byte absolute Buffer;
  5.   aFile:   TextFile;
  6.   Line:    Integer;
  7.   XShift:  Integer;
  8.   i:       Integer;
  9.   enc: String;
  10. begin
  11.   Line := 0;
  12.   AssignFile(aFile, 'sample.txt');
  13.   Reset(aFile);
  14.   while not EOF(aFile) do
  15.   begin
  16.     ReadLn(aFile, Buffer);
  17.     if Line = 0 then
  18.       enc := GuessEncoding(Buffer);
  19.     Inc(Line);
  20.     Canvas.Brush.Color := clDefault;
  21.     Canvas.Font.Color := clBlack;
  22.     Canvas.TextOut(20, Line*25, 'Line-' + Line.ToString);
  23.     for i := 0 to Length(Buffer)-1 do
  24.     begin
  25.       case Odd(i) of
  26.         True:
  27.           begin
  28.             Canvas.Font.Color := clGreen;
  29.             XShift := 70;
  30.           end;
  31.         False:
  32.           begin
  33.             Canvas.Font.Color := clBlack;
  34.             XShift := 73;
  35.           end;
  36.       end;
  37.       Canvas.Brush.Color := clWhite;
  38.       Canvas.TextOut(i*20+XShift, Line*25, IntToHex(Data[i], 2));
  39.     end;
  40.     //encoding
  41.     if enc = EncodingUTF8BOM then
  42.     begin
  43.       if Length(Buffer) > 0 then
  44.       begin
  45.         if Copy(Buffer, 1, Length(UTF8BOM)) <> UTF8BOM then
  46.         Buffer := UTF8BOM + Buffer;
  47.         Buffer := UTF8BOMToUTF8(Buffer);
  48.       end;
  49.     end
  50.     else if enc = EncodingUCS2LE then
  51.     begin
  52.       if Buffer[1] = #0 then
  53.         Delete(Buffer, 1, 1);
  54.       if Length(Buffer) > 0 then
  55.       begin
  56.         if Copy(Buffer, 1, Length(UTF16LEBOM)) <> UTF16LEBOM then
  57.         Buffer := UTF16LEBOM + Buffer;
  58.         Buffer := UCS2LEToUTF8(Buffer);
  59.       end;
  60.     end
  61.     else if enc = EncodingUCS2BE then
  62.     begin
  63.       if Buffer[Length(Buffer)] = #0 then
  64.         Delete(Buffer, Length(Buffer), 1);
  65.       if Length(Buffer) > 0 then
  66.       begin
  67.         if Copy(Buffer, 1, Length(UTF16BEBOM)) <> UTF16BEBOM then
  68.         Buffer := UTF16BEBOM + Buffer;
  69.         Buffer := UCS2BEToUTF8(Buffer);
  70.       end;
  71.     end;
  72.     //print string
  73.     Canvas.Font.Color := clRed;
  74.     Canvas.Font.Style := [fsBold];
  75.     Canvas.TextOut(i*20+XShift, Line*25, Buffer);
  76.     Canvas.Font.Style := [];
  77.   end;
  78.   CloseFile(aFile);
  79. end;
Best regards / Pozdrawiam
paweld

Handoko

  • Hero Member
  • *****
  • Posts: 5396
  • My goal: build my own game engine using Lazarus
Re: Readln and UTF16 TextFiles
« Reply #83 on: July 26, 2020, 02:39:29 pm »
Thank you paweld for the suggestion.

Yes, define the Buffer as String will make the results look okay. But why? The sample.txt is in UTF-16 encoding. Are Widestring and UTF16String not for storing UTF-16 data? Any explanation?

Quote
UTF16String

The type UTF16String is an alias to the type WideString. In the LCL unit lclproc it is an alias to UnicodeString.

WideString

Variables of type WideString (used to represent unicode character strings in COM applications) resemble those of type UnicodeString, but unlike them they are not reference-counted. On Windows they are allocated with a special windows function which allows them to be used for OLE automation.

WideStrings consist of COM compatible UTF16 encoded bytes on Windows machines (UCS2 on Windows 2000), and they are encoded as plain UTF16 on Linux, Mac OS X and iOS.
Source: https://wiki.freepascal.org/Character_and_string_types#UTF16String

Your code heavily used encoding functions, I'm not familiar with them. I need more time to understand them. They seem complicated, I wish it can be easier.

A question for you. Do you think actually what we were discussing wasn't any bug. That just we failed to understand how to work with UTF-16 string?

The code modified by you still showing 9 lines result from ReadLn.
« Last Edit: July 26, 2020, 02:41:56 pm by Handoko »

paweld

  • Hero Member
  • *****
  • Posts: 1323
Re: Readln and UTF16 TextFiles
« Reply #84 on: July 26, 2020, 03:07:03 pm »
... But why? The sample.txt is in UTF-16 encoding. Are Widestring and UTF16String not for storing UTF-16 data? Any explanation?
TextFile is such an extended version a file of byte/char (not a widechar) - I think :-)

Quote
The code modified by you still showing 9 lines result from ReadLn.
each CR and LF is taken as the end of the line
Best regards / Pozdrawiam
paweld

TRon

  • Hero Member
  • *****
  • Posts: 4139
Re: Readln and UTF16 TextFiles
« Reply #85 on: August 03, 2020, 04:40:27 am »
When this topic started, I did some research inside the rtl documentation but, as dumb as I am I did not marked down all my findings which explain some of the issues (one of those links had something particular to state with regards to readln/writln and line-ending icw ascii/utf depending on the platform)

However, I was able to re-find this link (https://www.freepascal.org/docs-html/rtl/system/tfiletextrecchar.html):
Quote
TFileTextRecChar is the type of character used in TextRec or FileRec file types. It is an alias type, depending on platform and RTL compilation flags. No assumptions should be made on the actual character type.
So in case wondering why this whole thing doesn't work out of the box as expected, it is because it depends on how the rtl was compiled and on what platform you are running. That makes perfect sense to me, and as such could imho only be considered a feature request and not a bug.
Today is tomorrow's yesterday.

 

TinyPortal © 2005-2018