Determining binary/text file type in Java?

前端 未结 10 1090
心在旅途
心在旅途 2020-12-02 16:46

Namely, how would you tell an archive (jar/rar/etc.) file from a textual (xml/txt, encoding-independent) one?

相关标签:
10条回答
  • 2020-12-02 17:34

    See http://en.wikipedia.org/wiki/Magic_number_(programming)

    0 讨论(0)
  • 2020-12-02 17:38

    You could try the DROID tool.

    0 讨论(0)
  • 2020-12-02 17:40

    If the file consists of the bytes 0x09 (tab), 0x0A (line feed), 0x0C (form feed), 0x0D (carriage return), or 0x20 through 0x7E, then it's probably ASCII text.

    If the file contains any other ASCII control character, 0x00 through 0x1F excluding the three above, then it's probably binary data.

    UTF-8 text follows a very specific pattern for any bytes with the high order bit, but fixed-length encodings like ISO-8859-1 do not. UTF-16 can frequently contain the null byte (0x00), but only in every other position.

    You'd need a weaker heuristic for anything else.

    0 讨论(0)
  • 2020-12-02 17:42

    There's no guaranteed way, but here are a couple of possibilities:

    1. Look for a header on the file. Unfortunately, headers are file-specific, so while you might be able to find out that it's a RAR file, you won't get the more generic answer of whether it's text or binary.

    2. Count the number of character vs. non-character types. Text files will be mostly alphabetical characters while binary files - especially compressed ones like rar, zip, and such - will tend to have bytes more evenly represented.

    3. Look for a regularly repeating pattern of newlines.

    0 讨论(0)
提交回复
热议问题