How to efficiently identify a binary file

后端 未结 4 1947
醉梦人生
醉梦人生 2021-02-10 19:36

What\'s the most efficient way to identify a binary file? I would like to extract some kind of signature from a binary file and use it to compare it with others.

The bru

相关标签:
4条回答
  • 2021-02-10 19:57

    An approach I found effective for this sort of thing was to calculate two SHA-1 hashes. One for the first block in a file (I arbitrarily picked 512 bytes as a block size) and one for the whole file. I then stored the two hashes along with a file size. When I needed to identify a file I would first compare the file length. If the lengths matched then I would compare the hash of the first block and if that matched I compared the hash of the entire file. The first two tests quickly weeded out a lot of non-matching files.

    0 讨论(0)
  • 2021-02-10 20:05

    Are you taking into account to use header identification. If you can design your files in such way, this would be fast and reliable. Using one byte you can distinguish 255 file types ;)

    0 讨论(0)
  • 2021-02-10 20:10

    That's what hashing is for. See MessageDigest.

    Note that if your file is too big to be read in memory, that's OK because you can feed chunks of the file to the hash function. MD5 and SHA1 for example can take blocks of 512 bits.

    Also, two files with the same hash aren't necessarily identical (it's very rare that they aren't though), but two files that are identical have necessarily the same hash.

    0 讨论(0)
  • 2021-02-10 20:10

    The usual answer is to use MD5, but I'd like to suggest that there are too many collisions to use MD5 in modern applications: http://www.mscs.dal.ca/~selinger/md5collision/

    SHA-1 replaced MD5 over a decade ago.

    NIST recommended in 2005 that SHA-2 should be used in place of SHA-1 by the year 2010, because of work that had been done to demonstrate collisions in reduced variants of SHA-1. (Which is pretty good foresight, since it is now known that it takes 2^51 work to find collisions in what should ideally require 2^80 work to find collisions.)

    So please, based on what you're trying to do, and which other programs you may need to interoperate with, select among MD5 (please no), SHA-1 (I'd understand, but we can do better), and SHA-2 (pick me! pick me!).

    0 讨论(0)
提交回复
热议问题