Recent

Author Topic: Readln and UTF16 TextFiles  (Read 25645 times)

Handoko

  • Hero Member
  • *****
  • Posts: 5399
  • My goal: build my own game engine using Lazarus
Re: Readln and UTF16 TextFiles
« Reply #60 on: July 23, 2020, 07:07:13 pm »
My tests showed ReadLn fails to correctly read UTF16 little endian file and the WriteLn does not correctly write line endings for UTF16 little endian data. I'm not skilled enough to debug into the code. Hope some more experienced seniors will check it. If the data you provided is in the legitimate encoding, sure it is bug in Read/Write Ln. Maybe you can consider to report to the bugtracker.

It is midnight here, I'm going to sleep now. I will test your sample3.txt tomorrow.

MarkMLl

  • Hero Member
  • *****
  • Posts: 8364
Re: Readln and UTF16 TextFiles
« Reply #61 on: July 23, 2020, 08:02:09 pm »
- WriteLn does not write correctly for widestring little endian, so I used Write only

I suspect that WriteLn() with no parameter is a major problem since without a hint associated with the file/stream it won't know how to encode the EOL.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Logitech, TopSpeed & FTL Modula-2 on bare metal (Z80, '286 protected mode).
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

Handoko

  • Hero Member
  • *****
  • Posts: 5399
  • My goal: build my own game engine using Lazarus
Re: Readln and UTF16 TextFiles
« Reply #62 on: July 24, 2020, 08:23:54 am »
- WriteLn does not write correctly for widestring little endian, so I used Write only

I suspect that WriteLn() with no parameter is a major problem since without a hint associated with the file/stream it won't know how to encode the EOL.

I did think about it. I even wrote a function to check it. First before WriteLn writes the EOL it check the UTF16 data is big or little endian. Then it writes the proper EOL. The pseudocode is something like this

WriteLnUTF16(var F: TextFile; Buffer: WideString);
var
  Index, Count:   Integer;
  isLittleEndian: Boolean;
  CRLF:           string;
begin

  Count := 0;
  for Index := 1 to Length(Buffer) do
    if not(Odd(Index)) and (Buffer[Index] = #0) then
      Inc(Count);
  isLittleEndian := Count > (Length(Buffer) div 4;

  case isLittleEndlian of
    True:  CRLF := #13#0#10#0;
    False: CRLF := #0#13#0#10;

  // .... do the writing

end;


Although the solution isn't very accurate, but it should work for most cases, because:

Quote
It is also reliable to detect endianness by looking for null bytes, on the assumption that characters less than U+0100 are very common. If more even bytes (starting at 0) are null, then it is big-endian.
Source: https://en.wikipedia.org/wiki/UTF-16#Byte_order_encoding_schemes

Unfortunately the code didn't work (on OP's Sample.txt). Because there were issues about ReadLn. Other possible solution is to introduce new WriteLn with an Endian parameter.

I said ReadLn has issues, here a simple test. The Sample.txt has 4 lines, you can know it by opening the file using a supported text editor (Writer, Pluma, etc). But if you use this code below, the ReadLn parses the file as 9 lines:

Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button1Click(Sender: TObject);
  2. var
  3.   Buffer: WideString;
  4.   inText: Text;
  5.   Count: Integer;
  6. begin
  7.   AssignFile(inText, 'Sample.txt');
  8.   Reset(inText);
  9.   Count := 0;
  10.   while not Eof(inText) do begin
  11.     ReadLn(inText, Buffer);
  12.     Inc(Count);
  13.   end;
  14.   ShowMessage(Count.ToString);
  15.   CloseFile(inText)
  16. end;

So, I am sure to say there are bugs in the ReadLn. I heard FreeBASIC has both UTF-16BE and UTF-16LE support. And what about Delphi? Anybody here use FreeBASIC or Delphi? Can you anyone please test the case on them?

Hi Handoko,
Finally, I again adapted your code to a GUI version, the only difference is that when the form is loaded, it runs your procedure (and push the output to a memo instead of the console output). Attached sample3.txt which contains garbage characters instead of the accentuated ones, memo output is corrupted too (only "��B" is displayed). This is my main concern about what can explain such difference with same source code?

I tried to open it using Writer, it show too many garbage characters. Tried to open it using online file viewer, most of them refuse to open it. As you said it contains garbage characters, so what result did you hope?

Garbage in, garbage out - the basic theory any programmer should know.
https://en.wikipedia.org/wiki/Garbage_in%2C_garbage_out
« Last Edit: July 24, 2020, 09:25:22 am by Handoko »

HomeBoy38

  • Jr. Member
  • **
  • Posts: 59
Re: Readln and UTF16 TextFiles
« Reply #63 on: July 24, 2020, 08:55:59 am »
Hi Handoko,

The interest was not in the output I posted, but in the fact that using your code in a form rather in a console application reacts totally differently (it is even worst than just then EOL issue). I was expecting the same results.
I can understand the Readln/Writeln not being utf16 compliant, I cannot understand why the kind of project has an impact on these functions (readln does not even read the bom correctly).

I tried VisualBasic a long time ago, I was disappointed because of DLL depedencies and weird reactions depending on the computer running my programs. I turned to Delphi 7 (for which I installed the Tnt Unicode) to make some simple unicode programs for my job and finally I switched to Lazarus some years ago, trying to port some of my programs (personal use only).
As we are talking about context, if I understood correctly, you are in Asia? Which encoding is mainly used there: unicode or ansi codepage? In this thread, UTF32 was mentioned, but I am not ready to handle it...

MarkMLl

  • Hero Member
  • *****
  • Posts: 8364
Re: Readln and UTF16 TextFiles
« Reply #64 on: July 24, 2020, 09:01:53 am »
Other possible solution is to introduce new WriteLn with an Endian parameter.

A better solution might be something slightly more general: a way of specifying whether a WideString constant is explicitly BE or LE. If that included being able to specify the endianness of an empty string, then that empty string could be passed to WriteLn() as a hint.

Alternatively if an explicit BOM hasn't been written to a UTF-16 file/stream then it implicitly takes the ordering "natural" to the host computer, with no overrides. That would work even for the tricky case of a file containing a single blank line, which couldn't take previous lines as a hint and couldn't defer blank-line output until something non-blank was written.

Quote
Hi Handoko,
Finally, I again adapted your code to a GUI version, the only difference is that when the form is loaded, it runs your procedure (and push the output to a memo instead of the console output). Attached sample3.txt which contains garbage characters instead of the accentuated ones, memo output is corrupted too (only "��B" is displayed). This is my main concern about what can explain such difference with same source code?

I tried to open it using Writer, it show too many garbage characters. Tried to open it using online file viewer, most of them refuse to open it. As you said it contains garbage characters, so what result did you hope?

Garbage in, garbage out - the basic theory any programmer should know.
https://en.wikipedia.org/wiki/Garbage_in%2C_garbage_out

Again, OP isn't cooperating by checking hex data so it's impossible to separate any issues caused by bad rendering into GUI componeents from the underlying ones.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Logitech, TopSpeed & FTL Modula-2 on bare metal (Z80, '286 protected mode).
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

Handoko

  • Hero Member
  • *****
  • Posts: 5399
  • My goal: build my own game engine using Lazarus
Re: Readln and UTF16 TextFiles
« Reply #65 on: July 24, 2020, 09:17:01 am »
I turned to Delphi 7 (for which I installed the Tnt Unicode) to make some simple unicode programs for my job

So Delphi 7 + extra module can work with UTF16 correctly.

Hi Handoko,
As we are talking about context, if I understood correctly, you are in Asia? Which encoding is mainly used there: unicode or ansi codepage?

Yes, I am a Chinese but my ancestor already moved to Indonesia long long time ago before I was born. Here we only use a-z and 0-9. That's easy, no headache.

Again, OP isn't cooperating ...

Maybe he did not know how to analyze and manipulate the hex data (in the hardware memory) or maybe he is writing a proprietary program, he is not willing to share too much information. Anyway, it is clear FPC does not fully support UTF16, we still can make it works but some workarounds are needed.

HomeBoy38

  • Jr. Member
  • **
  • Posts: 59
Re: Readln and UTF16 TextFiles
« Reply #66 on: July 24, 2020, 09:34:23 am »
Can you please try to open in Lazarus the Project2 I posted in #53 and just run it (it does not contain Handoko's fixes, but if needed I can do it).

Yes, I do not know how debugging as you are suggesting, but just have a look at the read bom and (for example) the first accentuated character (ignore the EOL at this stage).
It is the same code as in Project1 (this command line one works better).

Handoko

  • Hero Member
  • *****
  • Posts: 5399
  • My goal: build my own game engine using Lazarus
Re: Readln and UTF16 TextFiles
« Reply #67 on: July 24, 2020, 10:10:51 am »
Good to know you want to learn debugging.
Here is how I did it.

1.
I added some codes in loop, so the result is in pair of 2 bytes (with a space between them). It is useful, because (I checked Wikipedia) UTF16 always has data in a pair.

Code: Pascal  [Select][+][-]
  1.     begin
  2.        S := S + IntToHex(BA[i], 2);
  3.        if Odd(i) then S := S + ' ';
  4.     end;

2.
What can the image tell us? See image below. The ReadLn parsed the file as 9 lines. But we know it should be 4 lines only. So it is clear ReadLn does not work correctly as what we want.

3.
Where is the problem? There are 5 $00 characters. That's why my workaround is

  // Skip previous line ending remain (fix readln issue)
  if (Length(Buffer) = 1) then Exit;


4.
It should be okay now. But no, the endian were flipped. The data on the first line is $4200 $6F00 ... But the others are started with $00, for example line #3: $0044 $003F

5.
So I write to code to flip the data if it started with $00.

  // Shift buffer (fix readln issue)
  if Buffer[1] = #0 then
  begin
    for i := 0 to Length(Buffer)-2 do
      Buffer[i+1] := Buffer[i+2];
    SetLength(Buffer, Length(Buffer)-1);
  end;


6.
Basically, we have solved the ReadLn problems. But I later found that the WriteLn does not correctly write the line ending characters. I knew it by examining the hex code of the output file. So I manually write line ending using Write.

  // Add line ending
  NewBuffer := Buffer + #13#0#10#0;
  // Save
  Write(F, NewBuffer);


... but just have a look at the read bom

The BOM is not very useful here. I simply ignored it because the formation of the data provides more useful information.
« Last Edit: July 24, 2020, 10:36:22 am by Handoko »

HomeBoy38

  • Jr. Member
  • **
  • Posts: 59
Re: Readln and UTF16 TextFiles
« Reply #68 on: July 24, 2020, 11:27:57 am »
I see the interest of using HEX values to display what is happening, I had started to try to fix the EOL issue before your proposal and after reading your code, I was still far from the solution...

If you compare your results in the screenshot of your last post to MarkMLl's command line version:
Quote
$ ./homeboy38
$ xxd Sample.txt
00000000: fffe 4200 6f00 6e00 6a00 6f00 7500 7200  ..B.o.n.j.o.u.r.
00000010: 0d00 0a00 4400 e900 6300 6500 6d00 6200  ....D...c.e.m.b.
00000020: 7200 6500 0d00 0a00 4300 6800 e800 7100  r.e.....C.h...q.
00000030: 7500 6500 2000 6400 e900 6a00 6500 7500  u.e. .d...j.e.u.
00000040: 6e00 6500 7200 0d00 0a00 5000 ea00 6c00  n.e.r.....P...l.
00000050: 6500 2d00 4d00 ea00 6c00 6500 2000 e000  e.-.M...l.e. ...
00000060: 2000 ac20 2000 2100 0d00 0a00             ..  .!.....

The first line is similar, no accent, same hex values.
The second line is different, the D is ok ("0044"), but the "é" is "00E9" in the command line version where in the form it is "003F": see ?
I might have messed up my code because the "é" is "00EF BFBD" in my form (the rest of the line is the same as you)!

Handoko

  • Hero Member
  • *****
  • Posts: 5399
  • My goal: build my own game engine using Lazarus
Re: Readln and UTF16 TextFiles
« Reply #69 on: July 24, 2020, 12:24:40 pm »
Yes, I saw the 003F you mentioned. But not surprised me. As we already knew FPC does not fully support UTF16. The more you test the more strange things may happen.

Quote
UTF-16 is used internally by systems such as Microsoft Windows, the Java programming language and JavaScript/ECMAScript. It is also often used for plain text and for word-processing data files on MS Windows. It is rarely used for files on Unix-like systems. As of May 2019, Microsoft seems to have reversed course and now supports and recommends using UTF-8.[2]

UTF-16 never gained popularity on the web, where it is used by under 0.01% (1 hundredth of 1 percent) of web pages,[3] and UTF-8 dominates.[4] The Web Hypertext Application Technology Working Group (WHATWG) considers UTF-8 "the mandatory encoding for all [text]" and that for security reasons browser apps should not use UTF-16.[5] It is also the only web-encoding incompatible with ASCII.[6]
Source: https://en.wikipedia.org/wiki/UTF-16

Microsoft recommends UTF-8 and UTF-16 is used by under 0.01%. So I just don't care much about UTF-16 unless I'm writing for paid project that requires UTF-16 support.

If you really need to process UTF-16 files, I think it will be better to use BlockRead instead of ReadLn.
« Last Edit: July 24, 2020, 12:31:35 pm by Handoko »

HomeBoy38

  • Jr. Member
  • **
  • Posts: 59
Re: Readln and UTF16 TextFiles
« Reply #70 on: July 24, 2020, 12:39:57 pm »
This was more or less my question to you about used encoding, the drawback of utf16 is the size (doubled).
So the conclusion of the thread is FPC does not support utf16...

I intend to use ReadLN but in the end, I might end up with TFileStream as I used when my program was under Delphi

Thanks again to everyone's contribution

Handoko

  • Hero Member
  • *****
  • Posts: 5399
  • My goal: build my own game engine using Lazarus
Re: Readln and UTF16 TextFiles
« Reply #71 on: July 24, 2020, 12:42:11 pm »
I can understand the Readln/Writeln not being utf16 compliant, I cannot understand why the kind of project has an impact on these functions (readln does not even read the bom correctly).

I now understand what you meant. But to answer the question we need to see the code of how ReadLn work. And that is too far for my skill.

But, hey. Your finding is really valuable for improving FPC UTF16 support. If you are willing to sacrifice your time, keep testing. The more bugs reported, the better Lazarus/FPC is.

HomeBoy38

  • Jr. Member
  • **
  • Posts: 59
Re: Readln and UTF16 TextFiles
« Reply #72 on: July 24, 2020, 01:25:37 pm »
I do not have a problem spending time to improve the product, I more have a problem on my skill level, this thread proves how difficult was the communication and my understanding of things, I am afraid to be more pain than help.

Handoko

  • Hero Member
  • *****
  • Posts: 5399
  • My goal: build my own game engine using Lazarus
Re: Readln and UTF16 TextFiles
« Reply #73 on: July 24, 2020, 01:48:01 pm »
I am afraid to be more pain than help.

Let's think the positive thing. Your posts really made us now better understand how bad the FPC UTF-16 support is.

If you want others to better understand the issue. Next time try to provide:
- The compile-able source code, a strip-down version is okay
- The data that can be used for showing the issue
- Screenshots
- The steps to do it
- The environments you used for the test (OS, Lazarus/FPC version, etc)

MarkMLl

  • Hero Member
  • *****
  • Posts: 8364
Re: Readln and UTF16 TextFiles
« Reply #74 on: July 24, 2020, 02:50:08 pm »
Let's think the positive thing. Your posts really made us now better understand how bad the FPC UTF-16 support is.

I wouldn't say /bad/, so much, as would have benefited from much more aggressive testing.

I echo what's been said about test programs and files. Also in this case one of the potential culprits might have been any Unicode-management DLLs or SO libraries, which would have meant looking very carefully at the operating environment.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Logitech, TopSpeed & FTL Modula-2 on bare metal (Z80, '286 protected mode).
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

 

TinyPortal © 2005-2018