Recent

Author Topic: Read file as UTF-8  (Read 7130 times)

GreatCorn

  • New Member
  • *
  • Posts: 40
    • GreatCorn
Read file as UTF-8
« on: April 28, 2020, 12:21:01 pm »
I've been trying to make this work for too long and I'm out of ideas. All I need is to read a UTF-8 text file (for example, the contents are "█©"). Of course, reading a file of char results in gibberish ("тЦИ┬й"), reading as widechar results in question marks.
My code is:
Code: Pascal  [Select][+][-]
  1. var
  2.   openFile: file of widechar;
  3.   fileChar: widechar;
  4.   command: widestring;
  5.  
  6. begin
  7.   Assign(openFile, 'UTF8.txt');
  8.   Reset(openFile);
  9.   command := '';
  10.   Reset(openFile);
  11.   while not EOF(openFile) do
  12.   begin
  13.     Read(openFile, fileChar);
  14.     command := command+fileChar;
  15.   end;
  16.   Close(openFile);
  17.   WriteLn(command);
  18.   ReadLn(command);
  19. end.
Output: ??
Are there any ways to properly read and display a UTF-8 file? I'm using FreePascal.
« Last Edit: April 28, 2020, 08:38:51 pm by GreatCorn »

howardpc

  • Hero Member
  • *****
  • Posts: 4144
Re: Read file as UTF-8
« Reply #1 on: April 28, 2020, 12:29:42 pm »
You could try this approach:
Code: Pascal  [Select][+][-]
  1. program project1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. uses
  6.   Classes;
  7.  
  8. var
  9.   sl: TStringList;
  10.   s: String;
  11.  
  12. begin
  13.   sl := TStringList.Create;
  14.   try
  15.     sl.LoadFromFile('utf8.txt');
  16.     for s in sl do
  17.       WriteLn(s);
  18.     ReadLn;
  19.   finally
  20.     sl.Free;
  21.   end;
  22. end.

GreatCorn

  • New Member
  • *
  • Posts: 40
    • GreatCorn
Re: Read file as UTF-8
« Reply #2 on: April 28, 2020, 12:55:23 pm »
Thanks, but I'm trying not to use objfpc and Classes library due to compatibility reasons (also Classes weighting 200+ KB)

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12645
  • FPC developer.
Re: Read file as UTF-8
« Reply #3 on: April 28, 2020, 01:44:00 pm »
Thanks, but I'm trying not to use objfpc and Classes library due to compatibility reasons (also Classes weighting 200+ KB)

You can do without classes, maybe, using

Code: Pascal  [Select][+][-]
  1. assignfile(f,name,65001);

But that still requires mode objfpc or delphi.

GreatCorn

  • New Member
  • *
  • Posts: 40
    • GreatCorn
Re: Read file as UTF-8
« Reply #4 on: April 28, 2020, 05:09:03 pm »
Code: Pascal  [Select][+][-]
  1. assignfile(f,name,65001);
What is the third parameter though? Couldn't find any reference and even Lazarus doesn't recognize it.

Awkward

  • Full Member
  • ***
  • Posts: 154
Re: Read file as UTF-8
« Reply #5 on: April 28, 2020, 05:36:48 pm »
objpas.pp (OBJPAS mode)

Procedure AssignFile(out t:Text;p:pchar; aCodePage : TSystemCodePage);

GreatCorn

  • New Member
  • *
  • Posts: 40
    • GreatCorn
Re: Read file as UTF-8
« Reply #6 on: April 28, 2020, 06:27:19 pm »
objpas.pp (OBJPAS mode)

Procedure AssignFile(out t:Text;p:pchar; aCodePage : TSystemCodePage);
Strangely my objpas doesn't contain an overload with aCodePage.. Is this OS or version specific? I've got Win7, Lazarus 2.0.8.

flowCRANE

  • Hero Member
  • *****
  • Posts: 937
Re: Read file as UTF-8
« Reply #7 on: April 28, 2020, 06:28:53 pm »
Code: Pascal  [Select][+][-]
  1. program Project1;
  2. var
  3.   Input: TextFile;
  4.   Line: String;
  5. begin
  6.   Assign(Input, 'foo.txt');
  7.   Reset(Input);
  8.  
  9.   while not EoF(Input) do
  10.   begin
  11.     ReadLn(Input, Line); // just read the line
  12.     {..}
  13.   end;
  14.  
  15.   Close(Input);
  16. end.

Looks good for me — the Line always contain proper data in the form of UTF-8 codepoints.

(also Classes weighting 200+ KB)

So what?
« Last Edit: April 28, 2020, 06:31:45 pm by furious programming »
Lazarus 4.2 with FPC 3.2.2, Windows 11 — all 64-bit

Working solo on a top-down retro-style action/adventure game (pixel art), programming the engine from scratch, using Free Pascal and SDL3.

lucamar

  • Hero Member
  • *****
  • Posts: 4217
Re: Read file as UTF-8
« Reply #8 on: April 28, 2020, 07:03:22 pm »
Try like this (note: untested):

Code: Pascal  [Select][+][-]
  1. var
  2.   openFile: File of Char;
  3.   fileChar: Char;
  4.   tempStr: RawByteString;
  5.   command: UTF8String;
  6.  
  7. begin
  8.   Assign(openFile, 'UTF8.txt');
  9.   Reset(openFile);
  10.   tempStr := '';
  11.   while not EOF(openFile) do
  12.   begin
  13.     Read(openFile, fileChar);
  14.     tempStr := tempStr + fileChar;
  15.   end;
  16.   Close(openFile);
  17.   comand := tempStr;
  18.   WriteLn(command);
  19.   ReadLn(command);
  20. end.

A little convoluted but it should ensure that no conversion is applied either when assigning chars to tempStr or when assigning this last to command.

But frankly, for such a task and unless you're very much constrained by memory/disk considerations I would simply use ReadFileToString() (from unit FileUtil of LazUtils). I'm as lazy as that :)
« Last Edit: April 28, 2020, 07:09:30 pm by lucamar »
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

GreatCorn

  • New Member
  • *
  • Posts: 40
    • GreatCorn
Re: Read file as UTF-8
« Reply #9 on: April 28, 2020, 07:20:51 pm »
Using
Code: Pascal  [Select][+][-]
  1. Input: TextFile;
  2. {...}
  3. ReadLn(Input, Line);
didn't work but gave a different result: п>їв-?Вс;
using
Code: Pascal  [Select][+][-]
  1. tempStr: RawByteString;
  2. command: UTF8String;
  3. {...}
  4. comand := tempStr;
also didn't work but gave yet another result: я╗┐тЦИ┬й, which is closer to the first one I've got  :(

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12645
  • FPC developer.
Re: Read file as UTF-8
« Reply #10 on: April 28, 2020, 07:46:48 pm »
But is that problem in the reading, or the displaying? the windows console is terribly well at utf8.

lucamar

  • Hero Member
  • *****
  • Posts: 4217
Re: Read file as UTF-8
« Reply #11 on: April 28, 2020, 07:49:58 pm »
using
Code: Pascal  [Select][+][-]
  1. tempStr: RawByteString;
  2. command: UTF8String;
  3. {...}
  4. comand := tempStr;
also didn't work but gave yet another result: я╗┐тЦИ┬й, which is closer to the first one I've got  :(

:o
Are you completely sure the file is coded as UTF8? Because I just tested something like that here with a guaranteed UTF8 text file I use for this kind of tests and it worked as it should...

But is that problem in the reading, or the displaying? the windows console is terribly well at utf8.

I didn't though of that; it might well be it.
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

Awkward

  • Full Member
  • ***
  • Posts: 154
Re: Read file as UTF-8
« Reply #12 on: April 28, 2020, 07:56:32 pm »
objpas.pp (OBJPAS mode)

Procedure AssignFile(out t:Text;p:pchar; aCodePage : TSystemCodePage);
Strangely my objpas doesn't contain an overload with aCodePage.. Is this OS or version specific? I've got Win7, Lazarus 2.0.8.
idk, i copied it from FPC trunk sources

GreatCorn

  • New Member
  • *
  • Posts: 40
    • GreatCorn
Re: Read file as UTF-8
« Reply #13 on: April 28, 2020, 08:38:22 pm »
Are you completely sure the file is coded as UTF8?
File was saved with Notepad as UTF-8 (Notepad++ said it's with BOM, so I set it to plain UTF-8 and resaved it, which gave a yet another result (%u0432-?%u0412%u0441), but not the one I'm trying to achieve.
However, I tried just importing the Clipbrd (or even LCLClasses, testing purposes only) unit (LCLBase) and it actually worked. Displayed everything perfectly. Importing the Crt unit changed the result too (but to an another mess of gibberish). Weird.
As for
But is that problem in the reading, or the displaying? the windows console is terribly well at utf8.
, I made the program write to an output file whatever it read by char. The saved file had everything properly encoded and was pretty much identical to the file it read. But my main concern is displaying the text in the console.
« Last Edit: April 28, 2020, 08:40:50 pm by GreatCorn »

lucamar

  • Hero Member
  • *****
  • Posts: 4217
Re: Read file as UTF-8
« Reply #14 on: April 28, 2020, 09:12:37 pm »
I made the program write to an output file whatever it read by char. The saved file had everything properly encoded and was pretty much identical to the file it read. But my main concern is displaying the text in the console.

So it is a console problem, then. I'm not much versed anymore in Windows console idiosyncrazies, sorry, but this theme arises with surprising regularity in the forum; search for "Windows console UTF-8" and you'll probably find some tips to solve or mitigate the problem.

Or wait for the answer of someone more knowledgeable than I
:-[
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

 

TinyPortal © 2005-2018