I did more tests and these are what I found. I also provide the source codes and some screenshots for better describe the issues. Hope it can be useful for someone who want to fix/improve UTF-16 ReadLn and WriteLn.
Note:
Previously I use
array of char for reading the buffer directly. I was wrong, it should be
array of word. You can try to replace my codes below with array of byte (or char) and see weird result.
So far, the problems we found are:
1. WriteLn does not correctly write EOL characters for UTF-16 data
2. ReadLn does not give correct result when reading UTF-16 text file
3. ReadLn result is not consistent on Console Mode vs GUI and Linux vs Windows
For issue #1, there were some possible solutions suggested by MarkMLI:
https://forum.lazarus.freepascal.org/index.php/topic,46759.msg370898.html#msg370898For issue #2, see image
Sample Data (GHex n LibreOffice).png and
ReadLn-Console vs ReadLn-GUI (Linux).png. The original sample data is shown using GHex binary editor/viewer and LibreOffice Writer. And the data read into memory using ReadLn is shown on Console and GUI modes. There were different. The original sample.txt has 4 lines but ReadLn reads the file as 9 lines and the hex values are not exactly the same. The Linux Console version is pretty accurate. For easier comparing, I mark the every second hex values green.
For issue #3, in
ReadLn-Console vs ReadLn-GUI (Linux).png we can see the their hex values are not exactly the same. On
ReadLn-GUI Linux vs Windows.png we can see the original
$FFFE becomes
$FFFDFFFD (you can modify the code to see it clearly) on Windows but
$3F3F on Linux. Also the original
$00E9 becomes
$0000FFFD on Windows and
$003F on Linux.
The tests were performed on Lazarus 2.0.10 GTK Linux and Windows 7. The sample.txt and source codes (
readlnconsole &
readlngui) are provided for download.
This is ReadLnConsole:
program readlnconsole;
{$mode objfpc}{$H+}
uses
{$IFDEF UNIX}{$IFDEF UseCThreads}
cthreads,
{$ENDIF}{$ENDIF}
Classes, sysutils, Crt
{ you can add units after this };
var
Buffer: WideString;
Data: array of Word absolute Buffer;
aFile: TextFile;
Line: Integer;
i: Integer;
begin
Line := 0;
AssignFile(aFile, 'sample.txt');
Reset(aFile);
while not EOF(aFile) do
begin
ReadLn(aFile, Buffer);
Inc(Line);
TextColor(White);
Write(Line.ToString);
for i := 0 to Length(Buffer)-1 do
begin
case Odd(i) of
True:
begin
TextColor(Green);
end;
False:
begin
Write(' ');
TextColor(White);
end;
end;
Write(IntToHex(Data[i], 2));
end;
WriteLn;
end;
CloseFile(aFile);
end.
And this is ReadLnGUI:
unit Unit1;
{$mode objfpc}{$H+}
interface
uses
Classes, SysUtils, Forms, Controls, Graphics, StdCtrls;
type
{ TForm1 }
TForm1 = class(TForm)
Button1: TButton;
procedure Button1Click(Sender: TObject);
end;
var
Form1: TForm1;
implementation
{$R *.lfm}
{ TForm1 }
procedure TForm1.Button1Click(Sender: TObject);
var
Buffer: WideString;
Data: array of Word absolute Buffer;
aFile: TextFile;
Line: Integer;
XShift: Integer;
i: Integer;
begin
Line := 0;
AssignFile(aFile, 'sample.txt');
Reset(aFile);
while not EOF(aFile) do
begin
ReadLn(aFile, Buffer);
Inc(Line);
Canvas.Brush.Color := clDefault;
Canvas.Font.Color := clBlack;
Canvas.TextOut(20, Line*25, 'Line-' + Line.ToString);
for i := 0 to Length(Buffer)-1 do
begin
case Odd(i) of
True:
begin
Canvas.Font.Color := clGreen;
XShift := 70;
end;
False:
begin
Canvas.Font.Color := clBlack;
XShift := 73;
end;
end;
Canvas.Brush.Color := clWhite;
Canvas.TextOut(i*20+XShift, Line*25, IntToHex(Data[i], 2));
end;
end;
CloseFile(aFile);
end;
end.
This is the source codes: