Recent

Author Topic: to read an UTF8 text file  (Read 16780 times)

jormik

  • New member
  • *
  • Posts: 7
    • parolescritte
to read an UTF8 text file
« on: July 14, 2014, 06:00:36 pm »
I can't read correctly characters from a text file encoded in UTF8. I have made many attemps with the functions of FCP (for example with LazUTF8) but without result.
Must I write a personal function to manage  the variable number of bytes of the UTF8 encoding?
Thanks.

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2770
    • havefunsoft.com
Re: to read an UTF8 text file
« Reply #1 on: July 14, 2014, 06:06:00 pm »
I think you're reading the file just fine. The problem with outputting the results.

What are you trying to achieve?

jormik

  • New member
  • *
  • Posts: 7
    • parolescritte
Re: to read an UTF8 text file
« Reply #2 on: July 14, 2014, 06:53:15 pm »
I try to transform my old Firebird+Delphi6 project into a new UTF8 Firebird+Lazarus project. All is gone fine, but not the read of the textfiles.

I must scan textfiles (once ANSI, now UTF8), char by char, and then build appropriate strings to populate the db. The problem is in the variable number of bytes of UTF8, that implies a procedure. LazUTF8 makes this job, but not for files, only for obtain, for example, code points of Unicode (that I use in other parts of project).

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2770
    • havefunsoft.com
Re: to read an UTF8 text file
« Reply #3 on: July 14, 2014, 07:05:46 pm »
hmm... could you please provide an example?

jormik

  • New member
  • *
  • Posts: 7
    • parolescritte
Re: to read an UTF8 text file
« Reply #4 on: July 14, 2014, 09:41:07 pm »
This is the skeleton of a procedure that inserts word into a table of the database.

============================================
procedure example;
var
  t: integer;
  doc: TextFile;
  character: char;
  word: string;
begin
  OpenDialog.Execute;
  AssignFile(doc, OpenDialog.FileName);
  reset(doc);
  word := '';
  for  t := 1 to 100 do
    begin
      read(doc, character);
      if character = ' ' then
        begin
          // write word in the database
          word := ''
        end
      else
         word := word + carattere;
      end;
  CloseFile(doc);
end;
============================================

It works fine with ANSI textfiles, where a character corresponds always to a byte. How can I obtain the same result with UTF8 textfiles, where NOT always a character corresponds to a byte?

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2770
    • havefunsoft.com
Re: to read an UTF8 text file
« Reply #5 on: July 14, 2014, 10:01:14 pm »
is there a chance that the issue is with database collation?

no wait. I'm being stupid here.

actually, yes. it still should work. what's the collation?
« Last Edit: July 14, 2014, 10:06:12 pm by skalogryz »

jormik

  • New member
  • *
  • Posts: 7
    • parolescritte
Re: to read an UTF8 text file
« Reply #6 on: July 14, 2014, 11:27:09 pm »
I'm not sure to have understood, but I think that there is no problem with database collation.

I need a function that recognizes, in a UTF8 textfile, when it is to load, from the file, a bunch of one, two, three or four bytes to represent the opportune Unicode character.

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2770
    • havefunsoft.com
Re: to read an UTF8 text file
« Reply #7 on: July 15, 2014, 05:06:17 am »
Looking at the code of your function, I don't see a specific need for you to recognize a multi-byte characters. Just because you're looking for a space-character. Space-character is a single-byte character in UTF8 as well.

Now. Back to your problem. What I suspect is that the issue is either in db collation OR db component that you're using.

I don't know how you actually push the word to the database, but I'd suggest you to try the following:
Code: [Select]
  for  t := 1 to 100 do
    begin
      read(doc, character);
      if character = ' ' then
        begin
          word:=Utf8ToAnsi(word);
          // write word in the database
          word := ''
        end
      else
         word := word + carattere;
      end;

jormik

  • New member
  • *
  • Posts: 7
    • parolescritte
Re: to read an UTF8 text file
« Reply #8 on: July 15, 2014, 10:27:03 am »
The db is OK. I turned it into UTF8 encoding, so now, when I put a field with not Ansi characters from a table of Firebird into a Lazarus textedit, and viceversa, all goes fine.

The problem comes with UTF8/Unicode files.

Please, take a look to the uploaded project (here summarized), compile it and read the two encoded textfiles (exUTF8.txt and exUnicode.txt). I followed what you have suggested (use Utf8ToAnsi function) but the result is wrong.

Code: [Select]
procedure TForm1.ButtonClick(Sender: TObject);
var
  doc: TextFile;
  character: Char;
begin
  Edit1.Text:='';
  Edit2.Text:='';
  memo.Text:='';
  OpenDialog.Execute;
  AssignFile(doc, OpenDialog.FileName);
  reset(doc);
  while not EOF(doc) do
    begin
      read(doc,character);
      Edit1.Text:=Edit1.Text+character;
      Edit2.Text:=Edit2.Text+Utf8ToAnsi(character);
      memo.Text:=memo.Text+IntToStr(ord(character))+chr(13);
    end;
  CloseFile(doc);
end;

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2770
    • havefunsoft.com
Re: to read an UTF8 text file
« Reply #9 on: July 15, 2014, 02:20:45 pm »
Please, take a look to the uploaded project (here summarized), compile it and read the two encoded textfiles (exUTF8.txt and exUnicode.txt). I followed what you have suggested (use Utf8ToAnsi function) but the result is wrong.
...
yes, they're wrong, because you've changed the code completely!

Originally, you've been building a word of different characters and then pushing the complete word (that keeps the whole UTF8 encoding).
In this project, you're forcefully breaking the utf8 encoded word into separate characters.

I've updated the project (pretty much returned to the original code you posted).
As you can see, UTF8 now loads fine. (I've also added BOM skipping to the code).

Unicode loading will require a different approach.
You will either need to read "WideChars" and build them into "WideString" (which is straight forward).
Or reading "char" by "char", build a buffer that turns into "WideString".

jormik

  • New member
  • *
  • Posts: 7
    • parolescritte
Re: to read an UTF8 text file
« Reply #10 on: July 15, 2014, 06:07:04 pm »
Great!
Thank you.

 

TinyPortal © 2005-2018