Recent

Author Topic: [Closed] How to find out the file content type?  (Read 2045 times)

loaded

  • Hero Member
  • *****
  • Posts: 825
[Closed] How to find out the file content type?
« on: April 17, 2023, 07:37:14 am »
Hi All
Is it possible to tell if a file is binary or just a text file without opening it?
I would be glad if anyone can share his experience on the subject. Respects.
« Last Edit: April 17, 2023, 12:43:44 pm by loaded »
Check out  loaded on Strava
https://www.strava.com/athletes/109391137

dbannon

  • Hero Member
  • *****
  • Posts: 2802
    • tomboy-ng, a rewrite of the classic Tomboy
Re: How to find out the file content type?
« Reply #1 on: April 17, 2023, 07:41:32 am »
Under Linux/Unix, the OS command 'file' gives you pretty accurate answer for most files.  I looked some time ago for an equivalent Windows command without success.

Davo
Lazarus 3, Linux (and reluctantly Win10/11, OSX Monterey)
My Project - https://github.com/tomboy-notes/tomboy-ng and my github - https://github.com/davidbannon

Scoops

  • Full Member
  • ***
  • Posts: 100
Re: How to find out the file content type?
« Reply #2 on: April 17, 2023, 07:53:35 am »
Maybe file type header files can point somewhere look for newer links i didnt check


https://en.wikipedia.org/wiki/List_of_file_signatures

KodeZwerg

  • Hero Member
  • *****
  • Posts: 2079
  • Fifty shades of code.
    • Delphi & FreePascal
Re: How to find out the file content type?
« Reply #3 on: April 17, 2023, 08:25:38 am »
Hi All
Is it possible to tell if a file is binary or just a text file without opening it?
I would be glad if anyone can share his experience on the subject. Respects.
Without opening? (...that would include no signature scanning...)
So the only possible solution is to compare the file's extension to the List of filename extensions.
(.txt, .rtf, .doc, .pdf, .asc, .json, .xml, .csv, .ini and many more I forgot to name here = might be text file, and the rest might be binary)
« Last Edit: April 17, 2023, 09:22:53 am by KodeZwerg »
« Last Edit: Tomorrow at 31:76:97 xm by KodeZwerg »

TRon

  • Hero Member
  • *****
  • Posts: 2538
Re: How to find out the file content type?
« Reply #4 on: April 17, 2023, 08:42:42 am »
Is it possible to tell if a file is binary or just a text file without opening it?
Nope.

Relying on a extension as suggested by KodeZwerg is flawed not to mention impractical. In reality it does not tell tell you anything about the contents of the file itself.
« Last Edit: April 17, 2023, 08:47:12 am by TRon »

KodeZwerg

  • Hero Member
  • *****
  • Posts: 2079
  • Fifty shades of code.
    • Delphi & FreePascal
Re: How to find out the file content type?
« Reply #5 on: April 17, 2023, 09:22:26 am »
Relying on a extension as suggested by KodeZwerg is flawed not to mention impractical. In reality it does not tell tell you anything about the contents of the file itself.
I total agree with you, without opening and scanning everything is just an assumption.
« Last Edit: Tomorrow at 31:76:97 xm by KodeZwerg »

TRon

  • Hero Member
  • *****
  • Posts: 2538
Re: How to find out the file content type?
« Reply #6 on: April 17, 2023, 09:46:30 am »
Sorry if that came across as down-voting your answer KodeZwerg as that was not my intention.

It is just that legacy stuff like filename endings is cause for so many headaches that it ain't no fun, especially when you have a OS that uses a desktop interface that hides filename extensions by default.

Because of f.e. ill intentions you can't even 100% be sure that after a proper signature scan that a matching file actually contains valid content for the detected format.

Filename extensions are indeed only an indication/assumption and as such cause for many f*ck-ups.
« Last Edit: April 17, 2023, 09:50:35 am by TRon »

loaded

  • Hero Member
  • *****
  • Posts: 825
Re: How to find out the file content type?
« Reply #7 on: April 17, 2023, 10:28:04 am »
dbannon, Scoops, KodeZwerg and TRon thank you very much for your answers.
I guess I'm left with no choice but to look at the contents of the file.
Check out  loaded on Strava
https://www.strava.com/athletes/109391137

440bx

  • Hero Member
  • *****
  • Posts: 4063
Re: How to find out the file content type?
« Reply #8 on: April 17, 2023, 10:49:11 am »
I guess I'm left with no choice but to look at the contents of the file.
I just wanted to mention that even inspecting the file does not always provide a guarantee of making the determination accurately.

For instance, it is possible, just by coincidence (isolated case), to have a file that is treated as a binary file by some program that is entirely made of bytes in the range 32..127 which would normally be thought of as text but is interpreted by the app in a "binary" way.

Basically, it is possible to determine if a file is _not_ text but, what looks like a text file may be actually be a binary file (interpretation-wise.)

Depending on the reason for determining the file type, keeping that possibility in mind may be important.

(FPC v3.0.4 and Lazarus 1.8.2) or (FPC v3.2.2 and Lazarus v3.2) on Windows 7 SP1 64bit.

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1313
Re: How to find out the file content type?
« Reply #9 on: April 17, 2023, 11:02:40 am »
Linux looks at multiple things to determine the file type. Some of the things it looks for don't work in Windows, but you can check the magic number. There are libraries to help you with that.

KodeZwerg

  • Hero Member
  • *****
  • Posts: 2079
  • Fifty shades of code.
    • Delphi & FreePascal
Re: How to find out the file content type?
« Reply #10 on: April 17, 2023, 11:59:47 am »
For plain written ASCII files without Escape Codes you can take this as your byte ($00-$FF / 0-255) ruleset to compare against.

Those should be included:
$0D and $0A (dec 13 [Carriage Return] and dec 10 [Line Feed])
$20 to $7E (dec 32 to dec 126) - basic ASCII chars

And we continue with the extended ASCII...
extended ASCII codes can be very different depending on used CodePage (very often it is Windows-1252 aka CP-1252) but that's again not a rule.
Windows-1252 printable characters would be
$80 (dec 128)
$82 to $8C (dec 130 to dec 140)
$8E (dec 142)
$91 to $9C (dec 145 to dec 156)
$9E to $9F (dec 158 to dec 159)
$A1 to $AC (dec 161 to dec 172)
$AE to $FF (dec 174 to dec 255)

So as you can see, even plain ASCII ain't that easy peasy to really find a generic running solution.

And since you did not mentioned what you understand in meaning of "Text-Files", it become harder and harder ... (many "Text-Writing" programs using a binary way to safe it's documents)
« Last Edit: Tomorrow at 31:76:97 xm by KodeZwerg »

loaded

  • Hero Member
  • *****
  • Posts: 825
Re: How to find out the file content type?
« Reply #11 on: April 17, 2023, 12:43:29 pm »
440bx and SymbolicFrankthank you very much for your answers.

And since you did not mentioned what you understand in meaning of "Text-Files", it become harder and harder ...
By text files, I mean files that are read line by line, such as files in DXF format.
(many "Text-Writing" programs using a binary way to safe it's documents)
In accordance with the general opinion, I will implement the option to open files and read their headers.

Thank you to everyone who took the time to reply. Respects.
Check out  loaded on Strava
https://www.strava.com/athletes/109391137

 

TinyPortal © 2005-2018