The linux file
command does a very good job in recognising file types and gives very fine-grained results. The diff
tool is able to tell binary files f
A quick-and-dirty way is to look for a NUL
character (a zero byte) in the first K or two of the file. As long as you're not worried about UTF-16 or UTF-32, no text file should ever contain a NUL
.
Update: According to the diff manual, this is exactly what diff does.
You could try to give a
strings yourfile
command and compare the size of the results with the file size ... i'm not totally sure, but if they are the same the file is really a text file.
This approach uses same criteria as grep
in determining whether a file is binary or text:
is_text_file() {
grep -qI '.' "$1"
}
-q
Quiet; Exit immediately with zero status if any match is found-I
Process a binary file as if it did not contain matching data'.'
match any single character. All files (except an empty file)
will match this pattern.Commands like less, grep detect it quite easily(and fast). You can have a look at their source.
file
is still the command you want. Any file that is text (according to its heuristics) will include the word "text" in the output of file
; anything that is binary will not include the word "text".
If you don't agree with the heuristics that file
uses to determine text vs. not-text, then the question needs to be better specified, since text vs. non-text is an inherently vague question. For example, file
does not identify a PGP public key block in ASCII as "text", but you might (since it is composed only of printable characters, even though it is not human-readable).
These days the term "text file" is ambiguous, because a text file can be encoded in ASCII, ISO-8859-*, UTF-8, UTF-16, UTF-32 and so on.
See here for how Subversion does it.