Recent

Author Topic: Read file as UTF-8  (Read 7138 times)

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: Read file as UTF-8
« Reply #15 on: April 28, 2020, 09:51:35 pm »
Hi!

It is a console problem.

To write the UTF8 chars on the console you have to convert them to ANSI:

Code: Pascal  [Select][+][-]
  1. Var
  2. UTF8file: TextFile;
  3. s: string;
  4.  
  5. ...
  6. readln (UTF8file,s);
  7. s := UTF8toANSI (s);
  8. writeln (s) ;
  9.  

The Windows Console is too dumb for UTF8.
Don't know if that changed in Win10.

Win7/64

Winni

eljo

  • Sr. Member
  • ****
  • Posts: 468
Re: Read file as UTF-8
« Reply #16 on: April 28, 2020, 10:03:46 pm »
Hi!

It is a console problem.

To write the UTF8 chars on the console you have to convert them to ANSI:

Code: Pascal  [Select][+][-]
  1. Var
  2. UTF8file: TextFile;
  3. s: string;
  4.  
  5. ...
  6. readln (UTF8file,s);
  7. s := UTF8toANSI (s);
  8. writeln (s) ;
  9.  

The Windows Console is too dumb for UTF8.
Don't know if that changed in Win10.

Win7/64

Winni
use setconsoleoutputCP to make the console utf8 compatible.

GreatCorn

  • New Member
  • *
  • Posts: 40
    • GreatCorn
Re: Read file as UTF-8
« Reply #17 on: April 30, 2020, 10:21:13 am »
Code: Pascal  [Select][+][-]
  1. Var
  2. UTF8file: TextFile;
  3. s: string;
  4.  
  5. ...
  6. readln (UTF8file,s);
  7. s := UTF8toANSI (s);
  8. writeln (s) ;
Thank you, seems to almost work, but symbols like %u2588 or "%u03BF%u03CD%u03B6%u03BF" get turned into dashes or question marks, though it's better than nothing. Also, UTF-8 with BOM puts a question mark at the start.
use setconsoleoutputCP to make the console utf8 compatible.
This somehow makes it understand even less characters, as a lot of them get turned into whitespaces..
Edit seems like this forum's message edit feature also doesn't understand special characters
« Last Edit: April 30, 2020, 10:24:34 am by GreatCorn »

GreatCorn

  • New Member
  • *
  • Posts: 40
    • GreatCorn
Re: Read file as UTF-8
« Reply #18 on: April 30, 2020, 12:49:49 pm »
Importing the Crt unit breaks UTF8ToAnsi and it starts outputting wrong characters or pseudographics again. I need the Crt unit for stuff like ClrScr, so I can't leave it out. I was willing to accept UTF8ToAnsi not getting all of the characters, but this just brings the whole thing back down to a zero.

Bart

  • Hero Member
  • *****
  • Posts: 5689
    • Bart en Mariska's Webstek
Re: Read file as UTF-8
« Reply #19 on: April 30, 2020, 02:34:07 pm »
The crt unit doesn't breka AnsiToUtf8.
The crt unit however resets the consoles codepage upon each and every write, so that is a bit problematic.

Bart

lucamar

  • Hero Member
  • *****
  • Posts: 4217
Re: Read file as UTF-8
« Reply #20 on: April 30, 2020, 04:03:15 pm »
Importing the Crt unit breaks UTF8ToAnsi and it starts outputting wrong characters or pseudographics again.

As the documentation states:
Quote from: Reference for unit 'Crt' (#rtl)
There are some caveats when using the CRT unit:
  [...]
  * The CRT unit stems from the TP/Dos area. It is designed to work with single-byte character sets, where 1 char = 1 byte. That means that widestrings or UTF-8 encoded (ansi)strings will not work correctly.

Quote
I need the Crt unit for stuff like ClrScr, so I can't leave it out.

I don't know whether it still works in Free Pascal or how portable it is, but there was an old trick in TP times to allow you to use (most of) the functions of CRT w/out sacrifycing interoperability with the OS:
Code: Pascal  [Select][+][-]
  1. Assign(Input, ''); Reset(Input);
  2. Assign(Output, ''); Rewrite(Output);
That allowed you to, for example, use IO redirection from the commad line while still allowing you to use GotoXY, ClrScr, Sound, Delay, etc.
« Last Edit: April 30, 2020, 08:27:24 pm by lucamar »
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

GreatCorn

  • New Member
  • *
  • Posts: 40
    • GreatCorn
Re: Read file as UTF-8
« Reply #21 on: April 30, 2020, 07:15:39 pm »
Code: Pascal  [Select][+][-]
  1. Assign(Input, '');
  2. Assign(Output, '');
Assigning Output to '' and rewriting (without rewriting Runtime error 103) it sadly has the same effect as importing Crt, so when the unit is used it makes no difference. When it isn't, the procedure makes the output pretty much the same as how it would be when having Crt in use.
File: rosé, водка and ούζο█
No Crt, no Assign+Rewrite: ?rose, водка and ????-
Crt in use, no Assign+Rewrite: ?rose, тюфър and ????-
Assign+Rewrite, no Crt: ?rose, тюфър and ????-
Couldn't assign Input as it would give Runtime error 103 no matter what...

lucamar

  • Hero Member
  • *****
  • Posts: 4217
Re: Read file as UTF-8
« Reply #22 on: April 30, 2020, 08:35:43 pm »
I forgot to add Reset(Input) and Rewrite(Output), sorry. :-[

The question is whether, after doing that, SetConsoleOutputCP or UTF8ToAnsi or whatever work as they should.

Note that if your console's codepage doesn't have the proper characters (say, using cyrillic characters in a WIN-1252 console) the results will be as you describe no matter what. The "?" just means that there is no equivalent character in that codepage.
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

eljo

  • Sr. Member
  • ****
  • Posts: 468
Re: Read file as UTF-8
« Reply #23 on: April 30, 2020, 08:50:38 pm »
Code: Pascal  [Select][+][-]
  1. Assign(Input, '');
  2. Assign(Output, '');
Assigning Output to '' and rewriting (without rewriting Runtime error 103) it sadly has the same effect as importing Crt, so when the unit is used it makes no difference. When it isn't, the procedure makes the output pretty much the same as how it would be when having Crt in use.
File: rosé, водка and ούζο█
No Crt, no Assign+Rewrite: ?rose, водка and ????-
Crt in use, no Assign+Rewrite: ?rose, тюфър and ????-
Assign+Rewrite, no Crt: ?rose, тюфър and ????-
Couldn't assign Input as it would give Runtime error 103 no matter what...
are you on windows? If yes the you should call setconsoleoutpuCP and set the output to UTF8. Keep in mind that the OEM character set that the console is using by default does not support unicode or ansi it only support ascii and a second code page on the upper part of the byte for a single language. Keeping the default Code page you will only be able to use English and a single extra language depending on your country.

I remember seeing a console class somewhere that could be used instead of crt for windows but I can't remember where or the author at this point just that it was writen for delphi not fpc.

 

TinyPortal © 2005-2018