Recent

Author Topic: GuessEncoding and CovertEncoding, with SDFDataset, on Linux  (Read 27489 times)

tatamata

  • Hero Member
  • *****
  • Posts: 804
    • ZMSQL - SQL enhanced in-memory database
GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« on: November 06, 2016, 09:20:41 am »
I have a procedure which reads CSV files (loaded by SDFDataset) and import them as PostgreSQL tables.
There is a part of procedure which guesses CSV file encoding and tries to convert it to UTF8 (PostgreSQL database is UTF8).
Code: Pascal  [Select][+][-]
  1.   //Transform string cell value --> ensure UTF8 encoding and enclose string value with single quotes
  2.   vFieldValueStr:=ConvertEncoding(SDfDataset1.FieldByName(vFieldSourceName).AsString,
  3.                 GuessEncoding(SDfDataset1.FieldByName(vFieldSourceName).AsString),EncodingUTF8);
  4.   vFieldValueStr:=UTF8QuotedStr(vFieldValueStr,''''); //single quotes  around string  
  5.  
This procedure works fine on Windows and strings are correctly imported to the database. I have tested with both WIN1250 and UTF8 encoded CSV files and imported tables are always ok.
But, on Linux system, I get hieroglifs instead of southern Slavic letters č,ć,š,đ,ž when importing from WIN1250 encoded CSV files.
What am I doing wrong here?
« Last Edit: November 06, 2016, 09:44:19 am by tatamata »

Thaddy

  • Hero Member
  • *****
  • Posts: 19138
  • Glad to be alive.
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #1 on: November 06, 2016, 09:50:52 am »
what does:
echo $LANG

say on your linux system? Is it really UTF8?
and what does
locale
say?
« Last Edit: November 06, 2016, 09:55:40 am by Thaddy »
objects are fine constructs. You can even initialize them with constructors.

tatamata

  • Hero Member
  • *****
  • Posts: 804
    • ZMSQL - SQL enhanced in-memory database
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #2 on: November 06, 2016, 11:02:04 am »
echo $LANG says:
Code: Pascal  [Select][+][-]
  1. hr_HR.UTF-8

and locale says:
Code: Pascal  [Select][+][-]
  1. LANG=hr_HR.UTF-8
  2. LANGUAGE=
  3. LC_CTYPE="hr_HR.UTF-8"
  4. LC_NUMERIC="hr_HR.UTF-8"
  5. LC_TIME="hr_HR.UTF-8"
  6. LC_COLLATE="hr_HR.UTF-8"
  7. LC_MONETARY="hr_HR.UTF-8"
  8. LC_MESSAGES="hr_HR.UTF-8"
  9. LC_PAPER="hr_HR.UTF-8"
  10. LC_NAME="hr_HR.UTF-8"
  11. LC_ADDRESS="hr_HR.UTF-8"
  12. LC_TELEPHONE="hr_HR.UTF-8"
  13. LC_MEASUREMENT="hr_HR.UTF-8"
  14. LC_IDENTIFICATION="hr_HR.UTF-8"
  15. LC_ALL=


tatamata

  • Hero Member
  • *****
  • Posts: 804
    • ZMSQL - SQL enhanced in-memory database
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #3 on: November 06, 2016, 07:40:58 pm »
Should I replace TSdfDataset with TCsvDocument (http://wiki.freepascal.org/CsvDocument)? Is TCsvDocument actively developed anymore?

wp

  • Hero Member
  • *****
  • Posts: 13511
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #4 on: November 06, 2016, 08:04:45 pm »
TCsvDocument is included in fpc now (units csvreadwrite and csvdocument in fpc/.../packages/fcl-base/src). It won't help you I fear.  The problem is that you are expecting too much of GuessEncoding. A code page contains only 256 characters, and if there is nothing else it is impossible to tell whether the byte $D0 refers to the character 'Ð' (ISO8859-1), or 'Π' (CP1253) etc. Maybe using some kind of semantic analysis would help. The purpose of GuessEncoding is mostly to distinguish UTF8 with/without BOM from UTF16 with/without BOM from ANSI. It returns CP_ISO_8859_1 as a fallback result.

tatamata

  • Hero Member
  • *****
  • Posts: 804
    • ZMSQL - SQL enhanced in-memory database
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #5 on: November 06, 2016, 08:55:39 pm »
What do you mean by "semantic analysis"?
The problem is that the application should support importing any CSV file and I don't know which encoding it will be...

balazsszekely

  • Guest
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #6 on: November 06, 2016, 09:54:05 pm »
Quote
@tatamata
The problem is that the application should support importing any CSV file and I don't know which encoding it will be
The thing is you cannot reliably guess the encoding.

wp

  • Hero Member
  • *****
  • Posts: 13511
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #7 on: November 06, 2016, 10:10:34 pm »
What do you mean by "semantic analysis"?
I mean: analyze the most important words or the frequencies of characters or character combinations in the languages used by a codepage. It's just an idea, I don't know if Notepad++ does it this way - but it is able to determine the code page of a file.

tatamata

  • Hero Member
  • *****
  • Posts: 804
    • ZMSQL - SQL enhanced in-memory database
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #8 on: November 06, 2016, 11:34:43 pm »
Ok, instead of guessing the encoding, which might fail, is there a way to convert the csv file to UTF with BOM in a way that will always succeed, no matter what the initial encoding was?

Bart

  • Hero Member
  • *****
  • Posts: 5715
    • Bart en Mariska's Webstek
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #9 on: November 06, 2016, 11:55:58 pm »
Ok, instead of guessing the encoding, which might fail, is there a way to convert the csv file to UTF with BOM in a way that will always succeed, no matter what the initial encoding was?

How is that going to work if you do not know the original encoding in the first place?
The conversion routine needs to convert from a given encoding to another encoding.

Bart

tatamata

  • Hero Member
  • *****
  • Posts: 804
    • ZMSQL - SQL enhanced in-memory database
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #10 on: November 08, 2016, 07:33:22 am »
Yes, Bart, I get your point, but there are text editors that apparently can change encoding in one step, without specifying initial encoding. See CudaText for example. Though, I'm not sure whether CudaText do it through pascal or python (?).
This is of utter importance for me, so I will continue pursuit.
In meantime, I have found UTF8Tools (http://wiki.lazarus.freepascal.org/UTF8_Tools). It seemed promissing, so I tried with converting CSV file before loading with SdfDataset:
Code: Pascal  [Select][+][-]
  1. procedure ConvertFileToUTF8(pFilePathName: string);
  2. var
  3.    sl: TStringList;
  4.    a: integer;
  5.    f: TCharEncStream;
  6. begin
  7.   try
  8.         sl:=TStringList.Create;
  9.         f:=TCharEncStream.Create;
  10.         sl.LoadFromFile(pFilePathName);
  11.         f.LoadFromFile(pFilePathName);
  12.         sl.SaveToFile(pFilePathName+'.bcp');
  13.         {
  14.         for a:=0 to sl.count-1 do
  15.            sl[a]:=utf8encode(sl[a]);
  16.         }
  17.         sl.Text:=f.UTF8Text;
  18.         sl.SaveToFile(pFilePathName);
  19.   finally
  20.         sl.Free;
  21.         f.Free;
  22.   end;
  23. end;                                          
  24.  
Unfortunately, it didn't help either.

balazsszekely

  • Guest
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #11 on: November 08, 2016, 08:04:12 am »
Quote
@tatamata
but there are text editors that apparently can change encoding in one step, without specifying initial encoding.
Yes, the editors can guess the initial encoding. Notepad++ is doing a great job, sometimes fails though. The following wiki page explains why character detection is so unreliable : https://en.wikipedia.org/wiki/Charset_detection

Perhaps you can port the following delphi library to Lazarus: http://chsdet.sourceforge.net/
Others(non pascal):
  Mozilla:  https://dxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/base/
  Python: https://pypi.python.org/pypi/chardet

It would be nice to have similar feature in Lazarus.

Thaddy

  • Hero Member
  • *****
  • Posts: 19138
  • Glad to be alive.
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #12 on: November 08, 2016, 09:43:13 am »
Quote
Perhaps you can port the following delphi library to Lazarus: http://chsdet.sourceforge.net/
@Getmem 
Thats not difficult..took 3 minutes (second time... first effort 30 minutes)
Steps:
Unpack zip file in some directory.
Go to that directory.
open the chsd_dll_intf.pas file. Change stdcall to {$ifdef windows}stdcall{$else}cdecl{$endif} (search and replace)
change to the src directory.
open the chsdet.dpr file in src. comment out the *.res
open a terminal window (I compiled for linux) in the src directory.
Compile from the command fpc -Mdelphi -Fu./mbclass:./sbseq  chsdet.dpr

That gave me libchsdet.so ;)

Job done ;)  No dependencies on windows.

Note this has to be done from the commandline with -Mdelphi.
If you want to compile from Lazarus you have to add {$ifdef fpc}{$mode delphi}{$endif} to every single unit, but that is not necessary to build the library.

If it is tested by y'all and useful, maybe we can include it as a package. It is cross-platform.

Next job for me: look at a possible  ICU-c58 interface. That's more or less the standard.

[EDIT]
I forgot that chsdIntf also needs the conversion from stdcall to {$ifdef windows}stdcall{$else}cdecl{$endif}
« Last Edit: November 08, 2016, 12:28:27 pm by Thaddy »
objects are fine constructs. You can even initialize them with constructors.

balazsszekely

  • Guest
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #13 on: November 08, 2016, 11:41:07 am »
@Thaddy
I know the conversion is easy, the question is the encoding detection works? Did you run a few test?

Thaddy

  • Hero Member
  • *****
  • Posts: 19138
  • Glad to be alive.
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #14 on: November 08, 2016, 12:14:06 pm »
@Thaddy
I know the conversion is easy, the question is the encoding detection works? Did you run a few test?

I am playing with it ;) Just like with my next task ;) I'll report back. Play a little with it yourself. You know how to do that...
I'll report back.  Dunno how good it is. I test against Dutch/English/Lithuanian

btw: I wrote:
Quote
If it is tested by y'all and useful, maybe we can include it as a package. It is cross-platform.

So I expect some effort by others....

I forgot that chsdIntf also needs the conversion from stdcall to {$ifdef windows}stdcall{$else}cdecl{$endif}

As always, first make it compile, then test if it works then test if it is any good. Work in progress..  >:D
« Last Edit: November 08, 2016, 12:39:07 pm by Thaddy »
objects are fine constructs. You can even initialize them with constructors.

 

TinyPortal © 2005-2018