Recent

Author Topic: GuessEncoding and CovertEncoding, with SDFDataset, on Linux  (Read 27501 times)

tatamata

  • Hero Member
  • *****
  • Posts: 804
    • ZMSQL - SQL enhanced in-memory database
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #45 on: November 11, 2016, 02:40:46 pm »
Any simple example of how to use ICU4PAS for charset detection & conversion of a file?

Roland57

  • Hero Member
  • *****
  • Posts: 609
    • msegui.net
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #46 on: November 11, 2016, 04:05:50 pm »
@RolandC
If you switch to release mode the dll size will decrease from 415 KB to 183 KB.

Yes, 183 KB exactly. Thank you for the tip.  :)

Any simple example of how to use ICU4PAS for charset detection & conversion of a file?

I am discovering that library. After I downloaded ICU binaries here:

http://www.icu-project.org/download/3.6.html

I could compile and test successfully (under Windows 10) the csdet example.

Quote
C:\Atelier\Pascal\icu\samples\csdet>set path=..\..\bin;%path%

C:\Atelier\Pascal\icu\samples\csdet>example

C:\Atelier\Pascal\icu\samples\csdet>csdet 10.txt 134.txt 2001.txt
10.txt:
UTF-8 (**) 100
windows-1252 (fr) 98
ISO-8859-5 (ru) 3
windows-1254 (tr) 3
windows-1252 (en) 2
windows-1252 (nl) 2
windows-1252 (pt) 2
windows-1252 (sv) 2
windows-1252 (da) 1
windows-1252 (es) 1
windows-1252 (it) 1
windows-1252 (no) 1
windows-1250 (ro) 1

134.txt:
UTF-8 (**) 100
GB18030 (zh) 31
windows-1252 (fr) 24
Big5 (zh) 20
windows-1252 (nl) 18
windows-1252 (de) 17
windows-1252 (sv) 16
windows-1252 (en) 15
windows-1252 (da) 15
windows-1252 (pt) 15
windows-1254 (tr) 15
windows-1252 (es) 14
windows-1250 (hu) 14
windows-1252 (no) 13
windows-1252 (it) 12
windows-1250 (ro) 11
windows-1250 (pl) 10
windows-1250 (cs) 8
windows-1253 (el) 6
KOI8-R (ru) 2

2001.txt:
UTF-16LE (**) 100
ISO-8859-5 (ru) 98
Shift_JIS (ja) 10
ISO-8859-1 (pt) 4
ISO-8859-1 (it) 3
ISO-8859-1 (en) 2
ISO-8859-1 (es) 2
ISO-8859-2 (cs) 2
ISO-8859-2 (hu) 2
ISO-8859-2 (ro) 2
ISO-8859-1 (da) 1
ISO-8859-1 (fr) 1
ISO-8859-1 (no) 1
ISO-8859-2 (pl) 1

C:\Atelier\Pascal\icu\samples\csdet>
« Last Edit: November 11, 2016, 04:07:43 pm by RolandC »
My projects are on Codeberg.

tatamata

  • Hero Member
  • *****
  • Posts: 804
    • ZMSQL - SQL enhanced in-memory database
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #47 on: November 11, 2016, 08:31:06 pm »
@RolandC, did you try to load and use the  ICU4PAS library into some GUI app, instead of console program?

tatamata

  • Hero Member
  • *****
  • Posts: 804
    • ZMSQL - SQL enhanced in-memory database
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #48 on: November 11, 2016, 09:44:43 pm »
I put both .so files (libicu4pas.so and libicu4pasd.so) in app folder.
On FormCreate I have
Code: Pascal  [Select][+][-]
  1. vUIC4PasLibPath:=extractfiledir(application.exename) + DirectorySeparator + {$ifdef windows}'icu4pas36.'{$else}'libicu4pas.'{$endif} + SharedSuffix;
  2. vUIC4PasLibPath1:=extractfiledir(application.exename) + DirectorySeparator + {$ifdef windows}'icu4pas36d.'{$else}'libicu4pasd.'{$endif} + SharedSuffix;
  3. LoadLibrary(vUIC4PasLibPath);
  4. LoadLibrary(vUIC4PasLibPath1);
  5.  
then I have encoding detection function:
Code: Pascal  [Select][+][-]
  1. function DetermineCSVFileEncoding(pFilePathName: string): string;
  2.  const
  3.   BUFFER_SIZE = 8192;
  4. var
  5. ...
  6.    csd : UCharsetDetector_ptr;
  7.    csm : UCharsetMatch_ptr_ptr;
  8.    status : UErrorCode = U_ZERO_ERROR;
  9.    name ,lang : PChar;
  10.    inputLength ,match ,confidence : int32_t;
  11.    matchCount : int32_t = 0;
  12.    file_ : file;
  13.    filename : string;
  14.    buffer : array[0..BUFFER_SIZE - 1 ] of char;
  15. ...
  16. begin
  17.   if FileExistsUTF8(pFilePathName) then begin  
  18. ....
  19.         // Open CSV file
  20.         filename:=pFilePathName;
  21.         Assign(file_ ,filename);
  22.         reset (file_ ,1 );
  23.         inputLength:=fread(file_ ,@buffer[0 ] ,BUFFER_SIZE);
  24.         Close(file_ );
  25.  
  26.          //Detect charset
  27.          csd:=ucsdet_open(@status );
  28.          ucsdet_setText(csd ,@buffer[0 ] ,inputLength ,@status );
  29.          csm:=ucsdet_detectAll(csd ,@matchCount ,@status );
  30.          match:=0;
  31.          while match < matchCount do begin
  32.              name:=ucsdet_getName(UCharsetMatch_ptr_ptr(ptrcomp(csm ) + match * sizeof(UCharsetMatch_ptr ) )^ ,@status );
  33.              lang:=ucsdet_getLanguage(UCharsetMatch_ptr_ptr(ptrcomp(csm ) + match * sizeof(UCharsetMatch_ptr ) )^ ,@status );
  34.              confidence:=ucsdet_getConfidence(UCharsetMatch_ptr_ptr(ptrcomp(csm ) + match * sizeof(UCharsetMatch_ptr ) )^ ,@status );
  35.              if (lang = NIL ) or
  36.                 (StrLen(lang ) = 0 ) then
  37.               lang:=PChar('**' );
  38.  
  39.              ShowMessage(name  + ' ('  + lang  + ') '  + IntToStr(confidence));
  40.  
  41.              inc(match);
  42.             end;
  43.          ucsdet_close(csd );
  44. ...    
  45.  
Also I added following paths to Other Unit Files ( -Fu):
icu4pas-3_6-rm1/source
icu4pas-3_6-rm1/source/layout
icu4pas-3_6-rm1/source/unicode
and added to Include Files (-Fi):
icu4pas-3_6-rm1/source

I get linking error during compilation.
What am I doing wrong?
« Last Edit: November 12, 2016, 09:15:18 am by tatamata »

Roland57

  • Hero Member
  • *****
  • Posts: 609
    • msegui.net
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #49 on: November 14, 2016, 02:47:52 pm »
I get linking error during compilation.
What am I doing wrong?

Sorry for the late answer, but I have just read an interesting thing here :

Quote
Note that, in Linux, you will have to compile and install icu4pas shared library anyway, because linking process on Linux requires existing shared objects at compile time. On Windows, having the icu4pas dynamic link library on path at compile time is optional.

My projects are on Codeberg.

Thaddy

  • Hero Member
  • *****
  • Posts: 19268
  • Glad to be alive.
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #50 on: November 14, 2016, 03:03:54 pm »
Oh well. icu4pas lags behind. I am busy with a cross-platform solution. (As I wrote last week). It is a lot of work, though. I started from scratch.
And that should also work with dynlibs once the .so's are installed, just like the dll under windows needs to be installed. No difference there.

Note both libraries discussed here will need a big enough sample text ~1-5  KB to detect. So don't get your hopes up if you want it to detect a single string.
It is possible, though, on a larger text, to detect the encoding pretty close and convert a larger text.
Both are unusable to detect and convert shorter strings.
objects are fine constructs. You can even initialize them with constructors.

tatamata

  • Hero Member
  • *****
  • Posts: 804
    • ZMSQL - SQL enhanced in-memory database
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #51 on: November 14, 2016, 04:00:53 pm »
In my case, it is about huge csv files...so it should work.
But, I still cannot compile - still linking error, even if I put .so files in user/local/lib/ and export the path
Code: Pascal  [Select][+][-]
  1. export LD_LIBRARY_PATH=/usr/local/lib:${LD_LIBRARY_PATH}
as it is mentioned in the Readme file of isu4pas...
I don't get it, what is wrong?

Roland57

  • Hero Member
  • *****
  • Posts: 609
    • msegui.net
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #52 on: November 14, 2016, 06:36:55 pm »
But, I still cannot compile - still linking error, even if I put .so files in user/local/lib/ and export the path
Code: Pascal  [Select][+][-]
  1. export LD_LIBRARY_PATH=/usr/local/lib:${LD_LIBRARY_PATH}
as it is mentioned in the Readme file of isu4pas...
I don't get it, what is wrong?

I can't help you because I don't use Linux. Just one question: did you download the "good" version (i. e. 3.6) of ICU binaries?
« Last Edit: November 14, 2016, 06:38:59 pm by RolandC »
My projects are on Codeberg.

tatamata

  • Hero Member
  • *****
  • Posts: 804
    • ZMSQL - SQL enhanced in-memory database
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #53 on: November 15, 2016, 12:25:42 pm »
But, I still cannot compile - still linking error, even if I put .so files in user/local/lib/ and export the path
Code: Pascal  [Select][+][-]
  1. export LD_LIBRARY_PATH=/usr/local/lib:${LD_LIBRARY_PATH}
as it is mentioned in the Readme file of isu4pas...
I don't get it, what is wrong?

I can't help you because I don't use Linux. Just one question: did you download the "good" version (i. e. 3.6) of ICU binaries?
Hi! Yes, the ICU version is 3.6...

tatamata

  • Hero Member
  • *****
  • Posts: 804
    • ZMSQL - SQL enhanced in-memory database
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #54 on: November 17, 2016, 07:46:30 pm »
I am working on the ICU alternative, which is the standard detection library. But I also disagree to some extend that a lot of work needs to be done on this particular code.
Point me to what you want? Because this is pure Pascal, so from a purist point of view I actually like it.

BTW: 1252 vs 1250 is an issue: it should favor 1250 over 1252, but that is easy to fix.
@Thaddy, could you please do the fix? :-X I have no clue how to do it....

 

TinyPortal © 2005-2018