Forum > Packages and Libraries
CSVDocument and UTF-8
Leledumbo:
CSVDocument wiki says that the library could use UTF-8 encoding. However, when I load a UTF-8 (Urdu) text, I got broken text instead.
for instance: خبریں
got loaded as: â€Ø®Ø¨Ø±ÛŒÚº
Example prog:
--- Code: ---program TestCSVDocument;
{$mode objfpc}{$H+}
uses
{$IFDEF UNIX}{$IFDEF UseCThreads}
cthreads,
{$ENDIF}{$ENDIF}
Classes, SysUtils, CsvDocument
{ you can add units after this };
{$R *.res}
begin
with TCSVDocument.Create do
try
Delimiter := ';';
LoadFromFile(ParamStr(1));
SaveToFile(ChangeFileExt(ParamStr(1),'.out.csv'));
finally
Free;
end;
end.
--- End code ---
The resulting .out.csv file is not identical to the given .csv.
BigChimp:
Shot in the dark, but shouldn't you use a widestring manager (cwstring) in your uses clause to enable unicode support?
I seem to remember the requirements for that between Linux and Windows differed...
I have managed to stay away from Unicode up to now, so it might not make sense, but just in case I'm right & you forgot it.
Zoran:
--- Quote from: Leledumbo on February 28, 2012, 05:21:24 am ---CSVDocument wiki says that the library could use UTF-8 encoding. However, when I load a UTF-8 (Urdu) text, I got broken text instead.
for instance: خبریں
got loaded as: â€Ø®Ø¨Ø±ÛŒÚº
Example prog:
--- Code: ---program TestCSVDocument;
{$mode objfpc}{$H+}
uses
{$IFDEF UNIX}{$IFDEF UseCThreads}
cthreads,
{$ENDIF}{$ENDIF}
Classes, SysUtils, CsvDocument
{ you can add units after this };
{$R *.res}
begin
with TCSVDocument.Create do
try
Delimiter := ';';
LoadFromFile(ParamStr(1));
SaveToFile(ChangeFileExt(ParamStr(1),'.out.csv'));
finally
Free;
end;
end.
--- End code ---
The resulting .out.csv file is not identical to the given .csv.
--- End quote ---
Here it works perfectly with the code I copied from your post.
I tested on Windows 7, FPC 2.6.0, Lazarus 0.9.31 from trunk (this morning update), CSVDocument from trunk (this morning update).
I'm attaching the file to this post with which I tested - it is utf8 encoded, contains Cyrillic letters, east European Latin letters, west European and the (arabic?) word I copied from your post. Does it work for you with this file?
Leledumbo:
--- Quote ---Does it work for you with this file?
--- End quote ---
Hmm... yes, your file works. However, I found something else:
this (1): â€Ø®Ø¨Ø±ÛŒÚº
is how this (2): خبریں
gets displayed using code page property (I don't know mine, but I guess cp1252). So it looks like during reading (2) got read as (1), in code page property, but during writing (1) got written as utf-8, thus the urdu word is gone.
Got idea how to solve?
Leledumbo:
OK, I think I found the problem of my program. Attached is the self contained source code. It looks like XMLWrite that's causing the problem because when I try to write each Cell into another TStringList ended with SaveToFile the encoding is still not UTF-8, therefore preserving the original value while WriteXMLFile changes the encoding and lose the original value.
Navigation
[0] Message Index
[#] Next page