Recent

Author Topic: Text file: ANSI vs UTF-8  (Read 22050 times)

asdf

  • Sr. Member
  • ****
  • Posts: 310
Text file: ANSI vs UTF-8
« on: November 13, 2010, 04:28:25 am »
Why my UTF-8 text file has 3 hiden characters at the first 3 positions in the first line?
But not if converted to ANSI.

Lazarus 1.2.4 / Win 32 / THAILAND

asdf

  • Sr. Member
  • ****
  • Posts: 310
Re: Text file: ANSI vs UTF-8
« Reply #1 on: November 13, 2010, 04:32:30 am »
That's why I also can't use 'ansirightstr' in LZR/FPC.
http://www.lazarus.freepascal.org/index.php/topic,11054.0.html

Is LZR/FPC use only UTF-8?
Lazarus 1.2.4 / Win 32 / THAILAND

OpenLieroXor

  • New Member
  • *
  • Posts: 38
Re: Text file: ANSI vs UTF-8
« Reply #2 on: November 13, 2010, 07:27:42 am »
I suppose that's Byte Order Mark (http://en.wikipedia.org/wiki/Byte_order_mark).

Is LZR/FPC use only UTF-8?

AFAIK whole LCL uses UTF-8 encoding, but if you need ANSI representation of string, you can use Utf8ToAnsi and AnsiToUtf8 functions.

asdf

  • Sr. Member
  • ****
  • Posts: 310
Re: Text file: ANSI vs UTF-8
« Reply #3 on: November 13, 2010, 11:33:00 am »
Thank you so much  :) .
Lazarus 1.2.4 / Win 32 / THAILAND

asdf

  • Sr. Member
  • ****
  • Posts: 310
Re: Text file: ANSI vs UTF-8
« Reply #4 on: December 17, 2010, 04:21:31 pm »
In LZR, I have exported data into a text file ...

<<PRODUCT[00000001]PAGE[0109]PARAGRAPH[01]>>aaa
<<PRODUCT[00000001]PAGE[0109]PARAGRAPH[01]>>bbb
<<PRODUCT[00000001]PAGE[0109]PARAGRAPH[01]>>ccc
<<PRODUCT[00000001]PAGE[0109]PARAGRAPH[01]>>ddd
<<PRODUCT[00000001]PAGE[0109]PARAGRAPH[01]>>eee
.
.
.

In another procedure, I used the above text file ...
 
reset(tf1);
while not eof(tf1) do
begin
readln(tf1,ln);

showmessage(ansirightstr(ansileftstr(trimleft(utf8toansi(ln)),28),4)
           +' / ' +ansirightstr(ansileftstr(trimleft(utf8toansi(ln)),41),2));

Why ?
1. the result from first line was  [010 / [0
2. But the second line was 0109 / 01 as needed.
Lazarus 1.2.4 / Win 32 / THAILAND

asdf

  • Sr. Member
  • ****
  • Posts: 310
Re: Text file: ANSI vs UTF-8
« Reply #5 on: December 17, 2010, 04:45:12 pm »
I used MSWord to view and then saw an hidden character.
How can I write the code without if it is the first line then + 1 character ?
Lazarus 1.2.4 / Win 32 / THAILAND

typo

  • Hero Member
  • *****
  • Posts: 3051
Re: Text file: ANSI vs UTF-8
« Reply #6 on: December 17, 2010, 05:01:29 pm »
How did you write it to the file? Do the characters appear on NotePad?
« Last Edit: December 17, 2010, 05:04:20 pm by typo »

Zoran

  • Hero Member
  • *****
  • Posts: 1949
    • http://wiki.lazarus.freepascal.org/User:Zoran
Re: Text file: ANSI vs UTF-8
« Reply #7 on: December 17, 2010, 08:35:18 pm »
Add unit LConvEncoding to uses list. In this unit you will find many conversion functions between different encodings. Among them there is a function UTF8BOMToUTF8, which should solve your problem.
Swan, ZX Spectrum emulator https://github.com/zoran-vucenovic/swan

asdf

  • Sr. Member
  • ****
  • Posts: 310
Re: Text file: ANSI vs UTF-8
« Reply #8 on: December 18, 2010, 01:27:51 am »
How did you write it to the file? Do the characters appear on NotePad?

assignfile(tfl,fl);
if fileexists(fl) then
Append(tfl)
else
rewrite(tfl);
writeln(tfl,ECheckListBox.Items);
end;
end;
closefile(tfl);
.
.
.
And there's nothing strange in Notepad.                      
« Last Edit: December 18, 2010, 01:34:38 am by asdf »
Lazarus 1.2.4 / Win 32 / THAILAND

eny

  • Hero Member
  • *****
  • Posts: 1646
Re: Text file: ANSI vs UTF-8
« Reply #9 on: December 18, 2010, 12:14:23 pm »
And there's nothing strange in Notepad.

IIRC Notepad suppresses the BOM and will not display it.
All posts based on: Win10 (Win64); Lazarus 3_4  (x64) 25-05-2024 (unless specified otherwise...)

asdf

  • Sr. Member
  • ****
  • Posts: 310
Re: Text file: ANSI vs UTF-8
« Reply #10 on: December 18, 2010, 02:30:38 pm »
while not eof(tf1) do
begin
readln(tf1,ln);
ln:=utf8bomtoutf8(ln);
.
.
.
writeln(tf2,StuffString(ln, 1, 54, ''));

It worked well only in the first line,
but why did it replace 57 characters instead of 54 in every line starting in the second line until end ?
Lazarus 1.2.4 / Win 32 / THAILAND

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12256
  • FPC developer.
Re: Text file: ANSI vs UTF-8
« Reply #11 on: December 18, 2010, 04:07:18 pm »
That's why I also can't use 'ansirightstr' in LZR/FPC.
http://www.lazarus.freepascal.org/index.php/topic,11054.0.html

Is LZR/FPC use only UTF-8?

FPC uses the encoding of the system. There is a small problem under Windows, because there there are three default encodings:

- OEM encoding for the textmode console
- ansi encoding for most 1-byte API functions (can be UTF-8, but typically isn't)
- 2-byte API functions (the so called "wide" or -W functions) API

In general FPC uses the ansi encoding on Windows, and inserts oem2ansi conversions where necessary. It also usually calls the 1-byte -A API functions.

Lazarus assumes all 1-byte strings (ansistring) contain UTF-8.  This means ansi input from the system needs to be converted. (and the same for OEM,  but textmode console is less important for Lazarus, moreover FPC units already abstract that)

Lazarus (probably) tries to use 2-byte -W functions as much as possible, but utf-8 can be converted into 2 byte UTF-16 without information loss.

Older Delphi's (till and including D2007) are like FPC. Assume default ansi encoding.

Newer Unicode Delphi's (Delphi 2009 and beyond) are nearly entirely 2-byte in everything.  Moreover, they allow to attach the encoding (UTF-8, default ansi or OEM) to ansistrings so that conversions are done automatically when properly typed.


A project to implement such functionality (as in D2009+) in FPC has started but is stalled.


asdf

  • Sr. Member
  • ****
  • Posts: 310
Re: Text file: ANSI vs UTF-8
« Reply #12 on: December 18, 2010, 06:06:22 pm »
while not eof(tf1) do
begin
readln(tf1,ln);
ln:=utf8tocp874(ln);

Using utf8tocp874, it worked so well  :D .
Could anybody teach me this function ?
Lazarus 1.2.4 / Win 32 / THAILAND

 

TinyPortal © 2005-2018