Recent

Author Topic: CSVDocument and UTF-8  (Read 8128 times)

Leledumbo

  • Hero Member
  • *****
  • Posts: 8836
  • Programming + Glam Metal + Tae Kwon Do = Me
CSVDocument and UTF-8
« on: February 28, 2012, 05:21:24 am »
CSVDocument wiki says that the library could use UTF-8 encoding. However, when I load a UTF-8 (Urdu) text, I got broken text instead.

for instance: ‏خبریں
got loaded as: ‏خبریں

Example prog:
Code: [Select]
program TestCSVDocument;
 
{$mode objfpc}{$H+}
 
uses
  {$IFDEF UNIX}{$IFDEF UseCThreads}
  cthreads,
  {$ENDIF}{$ENDIF}
  Classes, SysUtils, CsvDocument
  { you can add units after this };
 
{$R *.res}
 
begin
  with TCSVDocument.Create do
    try
      Delimiter := ';';
      LoadFromFile(ParamStr(1));
      SaveToFile(ChangeFileExt(ParamStr(1),'.out.csv'));
    finally
      Free;
    end;
end.
The resulting .out.csv file is not identical to the given .csv.

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: CSVDocument and UTF-8
« Reply #1 on: February 28, 2012, 11:34:42 am »
Shot in the dark, but shouldn't you use a widestring manager (cwstring) in your uses clause to enable unicode support?
I seem to remember the requirements for that between Linux and Windows differed...

I have managed to stay away from Unicode up to now, so it might not make sense, but just in case I'm right & you forgot it.
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

Zoran

  • Hero Member
  • *****
  • Posts: 1988
    • http://wiki.lazarus.freepascal.org/User:Zoran
Re: CSVDocument and UTF-8
« Reply #2 on: February 28, 2012, 01:23:47 pm »
CSVDocument wiki says that the library could use UTF-8 encoding. However, when I load a UTF-8 (Urdu) text, I got broken text instead.

for instance: ‏خبریں
got loaded as: ‏خبریں

Example prog:
Code: [Select]
program TestCSVDocument;
 
{$mode objfpc}{$H+}
 
uses
  {$IFDEF UNIX}{$IFDEF UseCThreads}
  cthreads,
  {$ENDIF}{$ENDIF}
  Classes, SysUtils, CsvDocument
  { you can add units after this };
 
{$R *.res}
 
begin
  with TCSVDocument.Create do
    try
      Delimiter := ';';
      LoadFromFile(ParamStr(1));
      SaveToFile(ChangeFileExt(ParamStr(1),'.out.csv'));
    finally
      Free;
    end;
end.
The resulting .out.csv file is not identical to the given .csv.

Here it works perfectly with the code I copied from your post.
I tested on Windows 7, FPC 2.6.0, Lazarus 0.9.31 from trunk (this morning update), CSVDocument from trunk (this morning update).

I'm attaching the file to this post with which I tested - it is utf8 encoded, contains Cyrillic letters, east European Latin letters, west European and the (arabic?) word I copied from your post. Does it work for you with this file?
Swan, ZX Spectrum emulator https://github.com/zoran-vucenovic/swan

Leledumbo

  • Hero Member
  • *****
  • Posts: 8836
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: CSVDocument and UTF-8
« Reply #3 on: February 29, 2012, 05:00:49 am »
Quote
Does it work for you with this file?
Hmm... yes, your file works. However, I found something else:

this (1): ‏خبریں
is how this (2): ‏خبریں

gets displayed using code page property (I don't know mine, but I guess cp1252). So it looks like during reading (2) got read as (1), in code page property, but during writing (1) got written as utf-8, thus the urdu word is gone.

Got idea how to solve?

Leledumbo

  • Hero Member
  • *****
  • Posts: 8836
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: CSVDocument and UTF-8
« Reply #4 on: March 01, 2012, 08:15:12 am »
OK, I think I found the problem of my program. Attached is the self contained source code. It looks like XMLWrite that's causing the problem because when I try to write each Cell into another TStringList ended with SaveToFile the encoding is still not UTF-8, therefore preserving the original value while WriteXMLFile changes the encoding and lose the original value.

Leledumbo

  • Hero Member
  • *****
  • Posts: 8836
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: CSVDocument and UTF-8
« Reply #5 on: March 01, 2012, 06:25:49 pm »
Finally found the solution: simply UTF8Decode() all the strings before assigning them to the XML tree node Attributes or TextContent.

Zoran

  • Hero Member
  • *****
  • Posts: 1988
    • http://wiki.lazarus.freepascal.org/User:Zoran
Re: CSVDocument and UTF-8
« Reply #6 on: March 01, 2012, 10:32:24 pm »
It seems that XML units work with wide (ucs2) strings - see: http://www.lazarus.freepascal.org/index.php/topic,16087.msg87093.html#msg87093
So, the problem has nothing to do with CSVDocument.
Swan, ZX Spectrum emulator https://github.com/zoran-vucenovic/swan

Leledumbo

  • Hero Member
  • *****
  • Posts: 8836
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: CSVDocument and UTF-8
« Reply #7 on: March 02, 2012, 03:01:18 am »
Yes, as I've said in the last two reply, the error is related to XMLWrite, not CsvDocument.

 

TinyPortal © 2005-2018