Recent

Author Topic: GuessEncoding and CovertEncoding, with SDFDataset, on Linux  (Read 27128 times)

balazsszekely

  • Guest
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #15 on: November 08, 2016, 01:15:13 pm »
Quote
@Thaddy
I am playing with it ;) Just like with my next task ;) I'll report back
Ok, thanks. I'm interested in this subject, but momentarily I'm busy with other things. I can help later though.

tatamata

  • Hero Member
  • *****
  • Posts: 804
    • ZMSQL - SQL enhanced in-memory database
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #16 on: November 08, 2016, 04:23:39 pm »
Quote
Perhaps you can port the following delphi library to Lazarus: http://chsdet.sourceforge.net/
@Getmem 
Thats not difficult..took 3 minutes (second time... first effort 30 minutes)
Steps:
Unpack zip file in some directory.
Go to that directory.
open the chsd_dll_intf.pas file. Change stdcall to {$ifdef windows}stdcall{$else}cdecl{$endif} (search and replace)
change to the src directory.
open the chsdet.dpr file in src. comment out the *.res
open a terminal window (I compiled for linux) in the src directory.
Compile from the command fpc -Mdelphi -Fu./mbclass:./sbseq  chsdet.dpr

That gave me libchsdet.so ;)

Job done ;)  No dependencies on windows.

Note this has to be done from the commandline with -Mdelphi.
If you want to compile from Lazarus you have to add {$ifdef fpc}{$mode delphi}{$endif} to every single unit, but that is not necessary to build the library.

If it is tested by y'all and useful, maybe we can include it as a package. It is cross-platform.

Next job for me: look at a possible  ICU-c58 interface. That's more or less the standard.

[EDIT]
I forgot that chsdIntf also needs the conversion from stdcall to {$ifdef windows}stdcall{$else}cdecl{$endif}
I tried to follow your recipe, but I get error:
Code: Pascal  [Select][+][-]
  1. zlatko@zlatko-HP-ProBook-6570b ~/Preuzimanja/chsdet_026_src/src $ fpc -Mdelphi -Fu./mbclass:./sbseq  chsdet.dpr
  2. Free Pascal Compiler version 3.0.0+dfsg-2 [2016/01/28] for x86_64
  3. Copyright (c) 1993-2015 by Florian Klaempfl and others
  4. Target OS: Linux for x86-64
  5. Compiling chsdet.dpr
  6. Linking libchsdet.so
  7. /usr/bin/ld.bfd: warning: link.res contains output sections; did you forget -T?
  8. /usr/bin/ld.bfd: chsdIntf.o: relocation R_X86_64_32S against `TC_$CHSDINTF_$$_DETECTOR' can not be used when making a shared object; recompile with -fPIC
  9. chsdIntf.o: error adding symbols: Bad value
  10. chsdet.dpr(39) Error: Error while linking
  11. chsdet.dpr(39) Fatal: There were 1 errors compiling module, stopping
  12. Fatal: Compilation aborted
  13. Error: /usr/bin/ppcx64 returned an error exitcode
  14.  
When converted Delphi project and units with CodeTyphon I can compile and get the libchsdet.so, but then I get linking error when try to use the .so lib in my project...

Thaddy

  • Hero Member
  • *****
  • Posts: 18786
  • To Europe: simply sell USA bonds: dollar collapses
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #17 on: November 09, 2016, 05:10:50 am »
Recompile with -fPIC, because it is a library. You may also need -Xc but that should not be necessary. My mistake.
Windows works. Linux links but has some issues I am figuring out.
If Europe sells their USA bonds the USD will collapse. Europe can affort that given average state debts. The USA can't affort that. Just an advice...

Thaddy

  • Hero Member
  • *****
  • Posts: 18786
  • To Europe: simply sell USA bonds: dollar collapses
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #18 on: November 09, 2016, 06:10:31 am »
I've got it working, also under arm-linux. I will put up the sources in a few hours and post a link.
Code: Pascal  [Select][+][-]
  1. program testcharset;
  2. {$ifdef fpc}{$mode delphi}{$endif}
  3. uses chsdIntf,nsCore,classes;
  4. var
  5.   About:rAboutHolder;
  6.   Info:rCharsetInfo;
  7.   L:TStringlist;
  8.   S:String;
  9. begin
  10.   csd_GetAbout(About);
  11.   writeln(About.About);
  12.   L:=Tstringlist.Create;
  13.   try
  14.     L.LoadFromFile('../ReadMe.txt');// or whatever text file
  15.     S :=L.Text;
  16.     csd_Reset; 
  17.     csd_HandleData(PChar(S), Length(S));
  18.     if not csd_Done then csd_DataEnd;
  19.     Info := csd_GetDetectedCharset();
  20.     writeln(Info.Name);
  21.   finally
  22.     L.Free;
  23.   end;
  24. end.
  25.  
Outputs:
Code: [Select]
Charset Detector Library. Copyright (C) 2006 - 2008, Nick Yakowlew. http://chsdet.sourceforge.net
windows-1252
« Last Edit: November 09, 2016, 06:13:26 am by Thaddy »
If Europe sells their USA bonds the USD will collapse. Europe can affort that given average state debts. The USA can't affort that. Just an advice...

balazsszekely

  • Guest
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #19 on: November 09, 2016, 06:40:37 am »
Ok. I converted the whole project to lazarus:
   1. Reformatted all units with JEDI code format, because it was a mess
   2. Added dynlib to the uses clauses and changed 
Code: Pascal  [Select][+][-]
  1. CharsetDetectorLibrary = 'chsdet.dll';
  to 
Code: Pascal  [Select][+][-]
  1. CharsetDetectorLibrary = 'chsdet.' +   SharedSuffix;
      so the library extension is valid on all platform.
   3. Fixed paths/includes in the project options
   4. Fixed hints, warnings

@tatamata
Open "chsdet.lpi" from the "src" dir and build it. The library it's created in the same folder. Under linux the library name is "libchsdet.so" instead of "chsdet.so", please rename it. Copy "chsd_dll_intf.pas" and "chsdet.so" to your project folder. Your job is to test it.  :)

« Last Edit: November 09, 2016, 11:42:30 am by GetMem »

Thaddy

  • Hero Member
  • *****
  • Posts: 18786
  • To Europe: simply sell USA bonds: dollar collapses
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #20 on: November 09, 2016, 07:12:16 am »
@Getmem
Does my example work with your code too? It is statically linked here against the units.
It is verified against several codepages and texts of sufficient length.

Note there are no warnings to fix???. Only hints and these are harmless because of inheritance. Except the local vars.

This is what is left in my translation.:
Code: [Select]
fpc -B -Fu./sbseq -Fu/home/pi/Downloads/charset/src/sbseq -Fu/usr/lib/gcc/arm-linux-gnueabihf/4.9.2 -glh -O- -Fu/home/pi/Development/FreePascal/lazarus/lcl/units/arm-linux/* -Fi/home/pi/fpctrunk/packages/fv/src  -Fu/usr/local/lib/fpc/3.1.1/units/arm-linux/* -Fu/home/pi/synapse -vwhe -CfVFPV4 "testcharset.pas" (in directory: /home/pi/Downloads/charset2/src)
CustomDetector.pas(21,25) Hint: Parameter "aBuf" not used  // is used in derivaties
CustomDetector.pas(21,39) Hint: Parameter "aLen" not used // same
nsEscCharsetProber.pas(49,2) Hint: Local const "NUM_OF_ESC_CHARSETS" is not used  // these can be used when configured as per docs
MBUnicodeMultiProber.pas(52,2) Hint: Local const "NUM_OF_PROBERS" is not used        // same
testcharset.pas(11,21) Hint: Variable "About" does not seem to be initialized  //it's a record for **** sake
Compilation finished successfully.
« Last Edit: November 09, 2016, 07:22:09 am by Thaddy »
If Europe sells their USA bonds the USD will collapse. Europe can affort that given average state debts. The USA can't affort that. Just an advice...

balazsszekely

  • Guest
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #21 on: November 09, 2016, 07:27:02 am »
If I run your code with the newly created dll, every file is ASCII. I downloaded the original dll from the author's page, same issue.
I did the file encoding with Notepad++ and double checked with other text editors.

PS. AFAIk from fpc 3.0.0 on when you assign something to a string it will be converted to UTF8. In your example:
Code: Pascal  [Select][+][-]
  1. S := L.Text;
In my opinion you should read the content of the file directly to a PChar.

Thaddy

  • Hero Member
  • *****
  • Posts: 18786
  • To Europe: simply sell USA bonds: dollar collapses
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #22 on: November 09, 2016, 08:30:33 am »
I wonder why my baltic works then... But try this:
Code: Pascal  [Select][+][-]
  1. program testcharset;
  2. {$ifdef fpc}{$mode delphi}{$endif}
  3. uses chsdIntf,nsCore,classes;
  4. var
  5.   About:rAboutHolder;
  6.   Info:rCharsetInfo;
  7.   L:TFilestream;
  8.   S:PChar;
  9. begin
  10.   csd_GetAbout(About);
  11.   writeln(About.About);
  12.   L:=TFilestream.Create('../ReadMe.txt',fmOpenRead);
  13.   try
  14.     S := AllocMem(L.Size);
  15.     L.ReadBuffer(S^,L.Size);
  16.     csd_Reset; 
  17.     csd_HandleData(S, L.Size);
  18.     if not csd_Done then csd_DataEnd;
  19.     Info := csd_GetDetectedCharset();
  20.     writeln(Info.Name);
  21.   finally
  22.     FreeMem(S);
  23.     L.Free;
  24.   end;
  25. end.
  26.  

Note I did not test Lazarus at all.
[edit]
I edit my files in Geany, but used windows created files. (except the readme ;) )
Detection seems fine. We need a proper detection set of files though. Any suggestions?
« Last Edit: November 09, 2016, 08:48:19 am by Thaddy »
If Europe sells their USA bonds the USD will collapse. Europe can affort that given average state debts. The USA can't affort that. Just an advice...

balazsszekely

  • Guest
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #23 on: November 09, 2016, 08:55:28 am »
Now it's better.  :D
The attached text file(UTF8) is correctly guessed. I will let @tatamata to test other files with different encoding.
Anyway this is a good start, the code can be improved later.

Thaddy

  • Hero Member
  • *****
  • Posts: 18786
  • To Europe: simply sell USA bonds: dollar collapses
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #24 on: November 09, 2016, 09:08:36 am »
We should contact the author. Maybe he's reading this. 2008 is a long time ago.
He might be interested to know his code is cross-platform.
If Europe sells their USA bonds the USD will collapse. Europe can affort that given average state debts. The USA can't affort that. Just an advice...

Roland57

  • Hero Member
  • *****
  • Posts: 587
    • msegui.net
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #25 on: November 09, 2016, 09:16:15 am »
Hello! Sometime ago I made a little program that I called File Encoding Expert. It compares three ways of detecting the encoding of a file: using the Charset Detector library,  using a Lazarus function, using a Delphi function (previously exported in a DLL). It was compiled and tested under Windows only. I post it here: maybe it could interest someone, or maybe someone could help me to improve it.  :)

http://www.eschecs.fr/fichiers/fileencodingexpert.zip
My projects are on Codeberg.

Thaddy

  • Hero Member
  • *****
  • Posts: 18786
  • To Europe: simply sell USA bonds: dollar collapses
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #26 on: November 09, 2016, 09:36:52 am »
Hello! Sometime ago I made a little program that I called File Encoding Expert. It compares three ways of detecting the encoding of a file: using the Charset Detector library,  using a Lazarus function, using a Delphi function (previously exported in a DLL). It was compiled and tested under Windows only. I post it here: maybe it could interest someone, or maybe someone could help me to improve it.  :)

http://www.eschecs.fr/fichiers/fileencodingexpert.zip

Of course we will have a look!
For today, though, I am booked for  ICU-c58 to get working ;)

I expect that to work the fastest and most accurate, but I am often mistaken.
We also need a solution that is basically "on the fly".

The nice thing about this code, however, and a huge compliment to the author: Nikolaj Yakowlew,  is that it is 100% Object Pascal and 100% portable.
« Last Edit: November 09, 2016, 10:13:54 am by Thaddy »
If Europe sells their USA bonds the USD will collapse. Europe can affort that given average state debts. The USA can't affort that. Just an advice...

balazsszekely

  • Guest
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #27 on: November 09, 2016, 11:34:54 am »
From now on the library is loaded dynamically. No more linking error. It works both on Linux, Windows, Mac. I'm still not sure about the detection success though.

balazsszekely

  • Guest
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #28 on: November 09, 2016, 11:47:53 am »
I forget to add the zip file. Steps:

   1. Open "chsdet.lpi" from the "src" dir and build it
   2. Copy the library(chsdet.dll/libchsdet.so/libchsdet.dylib) to the demo folder
   3. Under linux/osx rename it to  "chsdet.so" respectively "chsdet.dylib" or change the source code accordingly
   4. Build and run the demo project

Please note: Under Osx you have to move the library into the bundle.


Thaddy

  • Hero Member
  • *****
  • Posts: 18786
  • To Europe: simply sell USA bonds: dollar collapses
Re: GuessEncoding and CovertEncoding, with SDFDataset, on Linux
« Reply #29 on: November 09, 2016, 12:50:24 pm »
From now on the library is loaded dynamically. No more linking error. It works both on Linux, Windows, Mac. I'm still not sure about the detection success though.

I am - with my translation -. It is pretty good. But we need a set of files to check against....

As documented, it needs a decent sample (1K)
« Last Edit: November 09, 2016, 01:20:21 pm by Thaddy »
If Europe sells their USA bonds the USD will collapse. Europe can affort that given average state debts. The USA can't affort that. Just an advice...

 

TinyPortal © 2005-2018