How do I uniquely identify the content of a media file in Python, not the metadata?

问题

I have a collection of media files, mostly music, most of them having been imported from CD many years ago. This collection has been transferred between different media players, different filesystems, different computers, etc, many times. In that process, some tracks have been accidentally duplicated. I'm also constantly trying to curate the metadata on these and get everything properly tagged, since when much of it was originally imported, I did not have fancy media playback software and did not even realize that the ID3 tags indicated that everything was just "Track %d" on the classic album "Album".

This creates a situation where I have some files with up-to-date metadata, but "duplicates" of the same media file that I'd like to delete, whose metadata has not been properly updated. Since the metadata is present within the file, the contents of these files now differ and tools like liten2 don't work.

My question is: is there a library I can use that will conveniently extract a uniquely identifying fingerprint (probably a SHA-1 hash, but that's not a hard requirement) of the media content only of the file, ignoring the metadata? If so, how do I use it?

回答1:

Echoprint is one free way to fingerprint audio by its content - i.e. it doesn't depend on metadata, nor on byte-exact data matches. Their FAQ has an entry "I want to deduplicate a big collection".

I think the core of it is not itself python but a web API - but they provide pyechonest library.

回答2:

You will probably need to dive a bit into the file format specifications of your audio files (mp3, avi, mpg, ogg, etc). For mp3 this would be to discard all ID3v2 Metadata chunks. Identify inside the file those chunks, that actually encode audio information and then hash those chunks for comparison. Bear in mind, that if you have two files of the same track in different formats, they will not be recognized as the same file. Also if you have the same track twice in the same format, but with e.g. different bitrates, they won't be identical neither.

回答3:

How about (temporarily) converting the files to WAV-format and comparing the hashes of them? The ID3 tags should be stripped off then. There are plenty of tools to do that and embedding this procedure into a script should be not too difficult.

来源：https://stackoverflow.com/questions/13784993/how-do-i-uniquely-identify-the-content-of-a-media-file-in-python-not-the-metada

标签

python

algorithm

audio

mp3

flac