Recent

Author Topic: Detect text files via magic number or...  (Read 6517 times)

edvard

  • Full Member
  • ***
  • Posts: 172
Detect text files via magic number or...
« on: January 14, 2015, 07:08:26 am »
Is there a simple way in FreePascal to detect whether a file is plain text or not?  (bonus for detecting other file types)  I've googled just about every variant of "pascal detect file type", read the wikipedia article on 'Magic Numbers', and read the documentation for FreePascal file handling.  So far, I have come up empty.

Any takers?
All children left unattended will be given a mocha and a puppy.

Arch (though I may go back to Debian)| FreePascal 3.2.2 + Lazarus 2.2.4, GTK2+ and Qt.  Mostly Qt...

CM630

  • Hero Member
  • *****
  • Posts: 1579
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Re: Detect text files via magic number or...
« Reply #1 on: January 14, 2015, 07:35:20 am »
1. First of all you should check the fourCC code (first four bytes of the file). It might also tell you the encoding of the file if it is a txt one.
2. Probably you could search the file for non-printable character, i.e. those whose ascii code is less than 32. If the file contains too many of them, then it should not be a text file.
But maybe these chars are to be counatined in UTF8 encodings, so it might get a little trickier.
3. You could search the text for some speciffic tags, which should be contained in HTML, RTF, DOC, etc.

Also, you should check UTF-8 Tools by Theo.
« Last Edit: January 14, 2015, 07:39:58 am by CM630 »
Лазар 4,4 32 bit (sometimes 64 bit); FPC3,2,2

derek.john.evans

  • Guest
Re: Detect text files via magic number or...
« Reply #2 on: January 14, 2015, 08:13:17 am »
the unit FileUtil has:

function FileIsText(const AFilename: string): boolean;

howardpc

  • Hero Member
  • *****
  • Posts: 4144
Re: Detect text files via magic number or...
« Reply #3 on: January 14, 2015, 09:48:48 am »
I expect you realise that all plain text detection algorithms produce educated guesses, rather than 100% certainties.

edvard

  • Full Member
  • ***
  • Posts: 172
Re: Detect text files via magic number or...
« Reply #4 on: January 15, 2015, 03:08:07 am »
the unit FileUtil has:

function FileIsText(const AFilename: string): boolean;

Damn!  Missed that one.  Thanks!  :D

I expect you realise that all plain text detection algorithms produce educated guesses, rather than 100% certainties.

Exactly.  That's why downstream gets a good ol' Try..Except.   8-)
All children left unattended will be given a mocha and a puppy.

Arch (though I may go back to Debian)| FreePascal 3.2.2 + Lazarus 2.2.4, GTK2+ and Qt.  Mostly Qt...

 

TinyPortal © 2005-2018