Recent

Author Topic: Readln and UTF16 TextFiles  (Read 25652 times)

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: Readln and UTF16 TextFiles
« Reply #45 on: July 22, 2020, 11:08:55 pm »
Hi!

If you save the text as UTF16 in LibreOffice you have activate the Checkbox for the BOM!.

In the attachment UTF16 text with BOM - made with LibreOffice.

Winni

HomeBoy38

  • Jr. Member
  • **
  • Posts: 59
Re: Readln and UTF16 TextFiles
« Reply #46 on: July 23, 2020, 08:22:58 am »
Thanks, I learned something in LibreOffice.
The produced file is the same as mine, and so MarkMLl's procedure still has the issue (works under a console app, not a gui one)

MarkMLl

  • Hero Member
  • *****
  • Posts: 8364
Re: Readln and UTF16 TextFiles
« Reply #47 on: July 23, 2020, 08:35:42 am »
Thanks, I learned something in LibreOffice.
The produced file is the same as mine, and so MarkMLl's procedure still has the issue (works under a console app, not a gui one)

One word of clarification. In my case the same program appeared to work identically whether the program was compiled from the command line or built by Lazarus. So I agree with your "console vs GUI" distinction, and you need to look at the buffer variable as binary rather than relying on GUI facilities to see what's in it.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Logitech, TopSpeed & FTL Modula-2 on bare metal (Z80, '286 protected mode).
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

MarkMLl

  • Hero Member
  • *****
  • Posts: 8364
Re: Readln and UTF16 TextFiles
« Reply #48 on: July 23, 2020, 08:39:26 am »
I just did some quick tests.

- MarkMLI code (post #39) didn't work on OP's Sample.txt (post #35)
- MarkMLI code worked correctly after resaving Sample.txt using LibreOffice Writer

So I believe, FPC works correctly with UTF-16 characters. It should be some uncorrect format in the original Sample.txt. Tested on Lazarus 2.0.10 Linux, LibreOffice Writer 6.4.3.2.

Note that the Sample.txt I was using has a BOM at the start, I did post a hex display of it.

OP, please check this:

$ file Sample.txt
Sample.txt: Little-endian UTF-16 Unicode text, with CRLF line terminators
$ cksum Sample.txt
2710770100 108 Sample.txt
$ md5sum Sample.txt
1421caf78a17dd9900692eb82a2c3814  Sample.txt

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Logitech, TopSpeed & FTL Modula-2 on bare metal (Z80, '286 protected mode).
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

HomeBoy38

  • Jr. Member
  • **
  • Posts: 59
Re: Readln and UTF16 TextFiles
« Reply #49 on: July 23, 2020, 08:47:24 am »
MarkMLl,

The difference is not how I build the program (I built both using Lazarus GUI), but how the project is defined (File / New / Project / Application or Program) in Lazarus, the code is slightly different in the header:
Code: Pascal  [Select][+][-]
  1. unit unit2;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. interface
  6.  
  7. uses
  8.   Classes, SysUtils, Forms, Controls, Graphics, Dialogs;
  9.  
  10. type
  11.   TForm1 = class(TForm)
  12.   private
  13.  
  14.   public
  15.  
  16.   end;
  17.  
  18. var
  19.   Form1: TForm1;
  20.  
  21. implementation
  22.  
  23. {$R *.lfm}
  24.  
  25. var
  26.   buffer, buffer2: widestring;
  27.   inText, outText: Text;
  28. begin
  29.   DefaultTextLineBreakStyle := tlbsCRLF;
  30.   AssignFile(inText, 'Sample.txt');
  31.   Reset(inText);
  32.   AssignFile(outText, 'Sample2.txt');
  33.   Rewrite(outText);
  34.   while not Eof(inText) do begin
  35.     ReadLn(inText, buffer);
  36.     WriteLn(outText, buffer);
  37.   end;
  38.   CloseFile(outText);
  39.   CloseFile(inText);
  40. end.
  41.  

MarkMLl

  • Hero Member
  • *****
  • Posts: 8364
Re: Readln and UTF16 TextFiles
« Reply #50 on: July 23, 2020, 09:34:35 am »
The difference is not how I build the program (I built both using Lazarus GUI), but how the project is defined (File / New / Project / Application or Program) in Lazarus, the code is slightly different in the header:

I know, I was making the point for the avoidance of all possible doubt.

Again, for the avoidance of all doubt please check the checksum of your test file in case it's getting mnagled by the download. And I don't think you've posted a complete program demonstrating the problem yet.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Logitech, TopSpeed & FTL Modula-2 on bare metal (Z80, '286 protected mode).
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

HomeBoy38

  • Jr. Member
  • **
  • Posts: 59
Re: Readln and UTF16 TextFiles
« Reply #51 on: July 23, 2020, 10:11:19 am »
MarkMLl,

I double checked the md5sum of my sample and I have the same value as you.

I am now working on 2 separate projects (not my application) and I have not more code that the ones published:
- one is your exact code provided (command line version)
- one is what I posted 2 posts ago (GUI application)
I do not have more.

Note : so far I am trying to understand the end of line issue

MarkMLl

  • Hero Member
  • *****
  • Posts: 8364
Re: Readln and UTF16 TextFiles
« Reply #52 on: July 23, 2020, 10:24:55 am »
You still need to look at the hex content of the strings you're reading.

You have still not posted a full program. I think by now that it's fairly obvious what's going on, but unless you do that it's not possible to verify it.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Logitech, TopSpeed & FTL Modula-2 on bare metal (Z80, '286 protected mode).
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

HomeBoy38

  • Jr. Member
  • **
  • Posts: 59
Re: Readln and UTF16 TextFiles
« Reply #53 on: July 23, 2020, 10:56:28 am »
Maybe you are asking for the whole files? I attached both projects (I removed executables because of the size):
#1/ project1 is the command line version (based on your proposal), output is:
Code: [Select]
FFFE42006F006E006A006F0075007200
00
004400E900630065006D00620072006500
00
0043006800E80071007500650020006400E9006A00650075006E0065007200
00
005000EA006C0065002D004D00EA006C0065002000E0002000AC2020002100
00
00

#2/ project2 is the gui version (quite identical to project1), memo1 output is:
Code: [Select]
EFBFBDEFBFBD42006F006E006A006F0075007200
00
004400EFBFBD00630065006D00620072006500
00
0043006800EFBFBD0071007500650020006400EFBFBD006A00650075006E0065007200
00
005000EFBFBD006C0065002D004D00EFBFBD006C0065002000EFBFBD002000EFBFBD2020002100
00
00

I have different hex values for non ANSI chars.

MarkMLl

  • Hero Member
  • *****
  • Posts: 8364
Re: Readln and UTF16 TextFiles
« Reply #54 on: July 23, 2020, 12:01:32 pm »
No I am NOT asking for the files. I am TELLING YOU that when you're debugging in a GUI environment you need to look at the hex data in memory, rather than relying on the dialog(ue) boxes etc. to render the characters properly. Otherwise you've got two- possibly three- distinct problems, and very little chance of disentangling them.

Look, I'm sorry if I'm being curt but I've got paying work I'm trying to do here which is rather more pressing than trying to sort things out for you particularly if you don't listen to what's being said.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Logitech, TopSpeed & FTL Modula-2 on bare metal (Z80, '286 protected mode).
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

HomeBoy38

  • Jr. Member
  • **
  • Posts: 59
Re: Readln and UTF16 TextFiles
« Reply #55 on: July 23, 2020, 01:54:53 pm »
I am listening, maybe I am not understanding well. Here, maybe my English, maybe the fact I do not know how to display the hex data in memory as suggested or maybe also because even if the data in memory is [in]correct, it might not help me either as I do not have such dev level.

I am sorry if you lost time on my problem. I did not intend to have a quick answer (this explains the topic age), I do not expect a dedicated support, I just hope someone more qualified than me can understand/reproduce the issue and hopefully this will help someone later if this happens to her/him.
I am willing to make all tests (this is why I tried both under a physical Windows 10 PC and under a Linux VM), as far as I understand them. At this point, I do not understand why the same piece of code produces different Hex values under different OSes when used in a different project type, and I have no clue how I can find a way to understand it.

If you think I am not doing things well, so maybe I should stop bothering here and cancel this thread?

MarkMLl

  • Hero Member
  • *****
  • Posts: 8364
Re: Readln and UTF16 TextFiles
« Reply #56 on: July 23, 2020, 03:16:57 pm »
If I use a test file like this

When I was young, I did frequent
Scholar and sage, and heard much argument
About it and about...
But always came I out
By the same door as in I went.


prepared with CRLF line endings, and run it through the same textbook program as yesterday, I get identical output:

Code: [Select]
$ cksum Sample-ascii*
977717508 155 Sample-ascii2.txt
977717508 155 Sample-ascii.txt

$ xxd Sample-ascii2.txt
00000000: 5768 656e 2049 2077 6173 2079 6f75 6e67  When I was young
00000010: 2c20 4920 6469 6420 6672 6571 7565 6e74  , I did frequent
00000020: 0d0a 5363 686f 6c61 7220 616e 6420 7361  ..Scholar and sa
00000030: 6765 2c20 616e 6420 6865 6172 6420 6d75  ge, and heard mu
00000040: 6368 2061 7267 756d 656e 740d 0a41 626f  ch argument..Abo
00000050: 7574 2069 7420 616e 6420 6162 6f75 742e  ut it and about.
00000060: 2e2e 0d0a 4275 7420 616c 7761 7973 2063  ....But always c
00000070: 616d 6520 4920 6f75 740d 0a42 7920 7468  ame I out..By th
00000080: 6520 7361 6d65 2064 6f6f 7220 6173 2069  e same door as i
00000090: 6e20 4920 7765 6e74 2e0d 0a              n I went...

If you compare that with yesterday's output, you will see that if the test file is UTF-16 the EOLs don't get deleted by ReadLn():

Code: [Select]
00000020: 6200 7200 6500 0d0a 000d 0a00 4300 6800  b.r.e.......C.h.
------------------------------^^^^^^^^^

That's the first problem. The second problem is that WriteLn() is emitting 8-bit CRLF rather than 16-bit:

Code: [Select]
00000020: 6200 7200 6500 0d0a 000d 0a00 4300 6800  b.r.e.......C.h.
-------------------------^^^^

The potential third problem is that you're relying on a dialog(ue) box to render 16-bit Unicode properly, I think the easiest thing would be to set a watchpoint on your buffer variable and then using the Properties dialogue to select Memory Dump. An alternative would be to explicitly iterate over the string dumping its content to the OS-level console, DO NOT rely on the console output window (also under View -> Debug Windows) for that since the hex+ASCII display there is too much at the mercy of the underlying debugger.

So you've potentially got three problems. Since you're consistently refused to post a concise example of a complete program to demonstrate what's going on, that's about all that can be said. Oh, and you do need to bear in mind that the third problem (GUI display) will be OS and widget set specific even if the other two (file handling) appear not to be.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Logitech, TopSpeed & FTL Modula-2 on bare metal (Z80, '286 protected mode).
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

HomeBoy38

  • Jr. Member
  • **
  • Posts: 59
Re: Readln and UTF16 TextFiles
« Reply #57 on: July 23, 2020, 03:53:40 pm »
For the First and Second problems, this is what you figured out already, you were better than me to find the cause. I cannot appreciate if it is a bug of readln/writeln or if it is a normal behavior, I do not have enough skill for that. Thanks to you, I clearly see what is happening for this specific problem.

In the outputs I produced, it was hex characters as you (more unreadable yes), where we can see both BOM and accentuated characters are not read the same. I am converting the widestring to hex values, I am not displaying widestring anymore to make those tests, so I should not have the issues you are referring to?

There is a communication issue between us, I am sure it comes from me. I am not refusing to post anything. I pushed both complete versions of my test project (with only the part where I have the problem based on your suggesion), the UTF16 source test file, it produces both hex results to compare (they should have been equals to me, but they are not).
I am really grateful to the help you offer me, I really do not understand what you are expecting  :-[

Handoko

  • Hero Member
  • *****
  • Posts: 5399
  • My goal: build my own game engine using Lazarus
Re: Readln and UTF16 TextFiles
« Reply #58 on: July 23, 2020, 06:20:05 pm »
Hello, I'm back with my not so good workaround.

Tested using Sample.txt posted on #35 using Lazarus 2.0.10 Linux. It works. It is not optimized for performance and may not work on big endian data.

What I found are:
- ReadLn does not work correctly on OP's data, it has 2 issues.
- WriteLn does not write correctly for widestring little endian, so I used Write only

Code: Pascal  [Select][+][-]
  1. program prjNewWriteLn;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. uses
  6.   {$IFDEF UNIX}
  7.   cthreads,
  8.   {$ENDIF}
  9.   Classes
  10.   { you can add units after this };
  11.  
  12. procedure WriteLnUTF16(var F: TextFile; Buffer: WideString);
  13. var
  14.   NewBuffer: WideString;
  15.   Memory:    array of Char absolute NewBuffer;
  16.   i:         Integer;
  17. begin
  18.   // Skip previous line ending remain (fix readln issue)
  19.   if (Length(Buffer) = 1) then Exit;
  20.   // Shift buffer (fix readln issue)
  21.   if Buffer[1] = #0 then
  22.   begin
  23.     for i := 0 to Length(Buffer)-2 do
  24.       Buffer[i+1] := Buffer[i+2];
  25.     SetLength(Buffer, Length(Buffer)-1);
  26.   end;
  27.   // Add line ending
  28.   NewBuffer := Buffer + #13#0#10#0;
  29.   // Save
  30.   Write(F, NewBuffer);
  31. end;
  32.  
  33. var
  34.   buffer: widestring;
  35.   inText, outText: Text;
  36.  
  37. begin
  38.   AssignFile(inText, 'Sample.txt');
  39.   Reset(inText);
  40.   AssignFile(outText, 'Sample2.txt');
  41.   Rewrite(outText);
  42.   while not Eof(inText) do begin
  43.     ReadLn(inText, buffer);
  44.     WriteLn(buffer);
  45.     WriteLnUTF16(outText, buffer);
  46. //    WriteLn(outText, buffer)
  47.   end;
  48.   CloseFile(outText);
  49.   CloseFile(inText)
  50. end.
« Last Edit: July 23, 2020, 06:33:01 pm by Handoko »

HomeBoy38

  • Jr. Member
  • **
  • Posts: 59
Re: Readln and UTF16 TextFiles
« Reply #59 on: July 23, 2020, 06:49:21 pm »
Hi Handoko,

First thanks, I tried it on my side and you fixed the end of line issue (both file output and console output is as expected). You might be right about big endian, it might need a little tweaking, but no need to work on it, it should be straight forward.

Secondly, what is your opinion regarding this: is it an issue with readln and writeln or a normal behavior and the user needs to adapt his code?

Finally, I again adapted your code to a GUI version, the only difference is that when the form is loaded, it runs your procedure (and push the output to a memo instead of the console output). Attached sample3.txt which contains garbage characters instead of the accentuated ones, memo output is corrupted too (only "��B" is displayed). This is my main concern about what can explain such difference with same source code?

thanks again

 

TinyPortal © 2005-2018