Forum > Packages and Libraries

CSVDocument and UTF-8

(1/2) > >>

Leledumbo:
CSVDocument wiki says that the library could use UTF-8 encoding. However, when I load a UTF-8 (Urdu) text, I got broken text instead.

for instance: ‏خبریں
got loaded as: ‏خبریں

Example prog:

--- Code: ---program TestCSVDocument;
 
{$mode objfpc}{$H+}
 
uses
  {$IFDEF UNIX}{$IFDEF UseCThreads}
  cthreads,
  {$ENDIF}{$ENDIF}
  Classes, SysUtils, CsvDocument
  { you can add units after this };
 
{$R *.res}
 
begin
  with TCSVDocument.Create do
    try
      Delimiter := ';';
      LoadFromFile(ParamStr(1));
      SaveToFile(ChangeFileExt(ParamStr(1),'.out.csv'));
    finally
      Free;
    end;
end.
--- End code ---
The resulting .out.csv file is not identical to the given .csv.

BigChimp:
Shot in the dark, but shouldn't you use a widestring manager (cwstring) in your uses clause to enable unicode support?
I seem to remember the requirements for that between Linux and Windows differed...

I have managed to stay away from Unicode up to now, so it might not make sense, but just in case I'm right & you forgot it.

Zoran:

--- Quote from: Leledumbo on February 28, 2012, 05:21:24 am ---CSVDocument wiki says that the library could use UTF-8 encoding. However, when I load a UTF-8 (Urdu) text, I got broken text instead.

for instance: ‏خبریں
got loaded as: ‏خبریں

Example prog:

--- Code: ---program TestCSVDocument;
 
{$mode objfpc}{$H+}
 
uses
  {$IFDEF UNIX}{$IFDEF UseCThreads}
  cthreads,
  {$ENDIF}{$ENDIF}
  Classes, SysUtils, CsvDocument
  { you can add units after this };
 
{$R *.res}
 
begin
  with TCSVDocument.Create do
    try
      Delimiter := ';';
      LoadFromFile(ParamStr(1));
      SaveToFile(ChangeFileExt(ParamStr(1),'.out.csv'));
    finally
      Free;
    end;
end.
--- End code ---
The resulting .out.csv file is not identical to the given .csv.

--- End quote ---

Here it works perfectly with the code I copied from your post.
I tested on Windows 7, FPC 2.6.0, Lazarus 0.9.31 from trunk (this morning update), CSVDocument from trunk (this morning update).

I'm attaching the file to this post with which I tested - it is utf8 encoded, contains Cyrillic letters, east European Latin letters, west European and the (arabic?) word I copied from your post. Does it work for you with this file?

Leledumbo:

--- Quote ---Does it work for you with this file?
--- End quote ---
Hmm... yes, your file works. However, I found something else:

this (1): ‏خبریں
is how this (2): ‏خبریں

gets displayed using code page property (I don't know mine, but I guess cp1252). So it looks like during reading (2) got read as (1), in code page property, but during writing (1) got written as utf-8, thus the urdu word is gone.

Got idea how to solve?

Leledumbo:
OK, I think I found the problem of my program. Attached is the self contained source code. It looks like XMLWrite that's causing the problem because when I try to write each Cell into another TStringList ended with SaveToFile the encoding is still not UTF-8, therefore preserving the original value while WriteXMLFile changes the encoding and lose the original value.

Navigation

[0] Message Index

[#] Next page

Go to full version