Recent

Author Topic: [SOLVED] Review request: Convert UTF8 file to ANSI/system codepage  (Read 6296 times)

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
[SOLVED] Review request: Convert UTF8 file to ANSI/system codepage
« on: September 22, 2013, 03:57:40 pm »
Argh. Unicode is catching up even with me ;)

For a file import program I currently accept ANSI files. However, I should also support UTF16 with BOM and for completeness UTF8 with BOM.

I'm going to cheat and convert everything to ANSI right now... and look at implementing the required Unicode support for intermediate storage in TDBF and export to CSV,Excel,Access,Firebird,SQLite,Tex,RTF... later.

For UTF8 with BOM, with FPC trunk, I've got this conversion code:
Code: [Select]
const
  UTF8BOM: string = #$EF#$BB#$BF;
...
var
  Count:integer;
  FileBegin: string[3];
  InStream: TFileStream;
  OutStream: TMemoryStream;
  TempString: RawByteString;
  UTF8Content: UTF8String;
...
// after detecting the UTF8 BOM in InStream
      SetLength(UTF8Content,InStream.Size-3); //Input file minus BOM
      InStream.Seek(3, soBeginning);
      InStream.Read(UTF8Content[1], Length(UTF8Content));
      TempString:=UTF8ToAnsi(UTF8Content); //or use lazutf8.UTF8ToSys
      OutStream.WriteAnsiString(TempString);

Is this about the right code? Am I on the right track to implement similar code for UTF16BE with BOM and UTF16LE with BOM or are there easier ways?

Thanks a lot.
« Last Edit: September 22, 2013, 06:00:10 pm by BigChimp »
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: [SOLVED] Review request: Convert UTF8 file to ANSI/system codepage
« Reply #1 on: September 22, 2013, 06:02:08 pm »
I gave in and went Lazarus ;)... and just converted to UTF8. I'll just have to deal with the rest of the chain later...

Code: [Select]
uses lconvencoding
...
procedure TScriptDumpReader.ConvertInputFile;
// Guesses encoding and converts to UTF8. Caches results
//
// Calling this function in multiple places will lead to duplicate conversions
// While this could be fixed, it's perhaps better to separate out converting/
// loading the file into a stream and letting the import code work on the stream
const
  GuessLimit=1000;
var
  InStream: TFileStream;
  GuessedSourceEncoding: string;
  OutStream: TMemoryStream;
  TempString: RawByteString;
begin
  if FConvertedFile='' then
  begin
    Instream:=TFileStream.Create(FInputFile,fmOpenRead,fmShareDenyWrite);
    OutStream:=TMemoryStream.Create;
    try
      SetLength(TempString,InStream.Size);
      InStream.Seek(0,soBeginning);
      InStream.Read(TempString[1],Length(TempString));
      if InStream.Size>GuessLimit then
        GuessedSourceEncoding:=GuessEncoding(Copy(TempString,1,GuessLimit))
      else
        GuessedSourceEncoding:=GuessEncoding(TempString);
      {$IFDEF DEBUG}
      FLog.Add('Input file ' + FInputFile + ' - guessed encoding: '+GuessedSourceEncoding);
      {$ENDIF}
      TempString:=ConvertEncoding(TempString,GuessedSourceEncoding,'UTF-8');
      OutStream.WriteAnsiString(TempString);
      FConvertedFile:=Sysutils.GetTempFileName;
      OutStream.SaveToFile(FConvertedFile);
    finally
      Instream.Free;
      OutStream.Free;
    end;
  end;
end;

Any suggestions for improving this of course welcome.

BTW: complete source:
https://bitbucket.org/reiniero/db2securityscript/src
directory OutputParser
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

 

TinyPortal © 2005-2018