How to Correlate Two Audio Events (Detect if they are Similar) in Python

∥☆過路亽.° 提交于 2020-06-16 19:05:20

问题


For my project I have to detect if two audio files are similar and when the first audio file is contained in the second. My problem is that I tried to use librosa the numpy.correlate. I don't know if I'm doing it in the right way. How can I detect if audio is contained in another audio file?

import librosa
import numpy
long_audio_series, long_audio_rate = librosa.load("C:\\Users\\Jerry\\Desktop\\long_file.mp3")
short_audio_series, short_audio_rate = librosa.load("C:\\Users\\Jerry\\Desktop\\short_file.mka")

for long_stream_id, long_stream in enumerate(long_audio_series):
    for short_stream_id, short_stream in enumerate(short_audio_series):
        print(numpy.correlate(long_stream, short_stream))

回答1:


Simply comparing the audio signals long_audio_series and short_audio_series probably won't work. What I'd recommend doing is audio fingerprinting, to be more precise, essentially a poor man's version of what Shazam does. There is of course the patent and the paper, but you might want to start with this very readable description. Here's the central image, the constellation map (CM), from that article:

If you don't want to scale to very many songs, you can skip the whole hashing part and concentrate on peak finding.

So what you need to do is:

  1. Create a power spectrogram (easy with librosa.core.stft).
  2. Find local peaks in all your files (can be done with scipy.ndimage.filters.maximum_filter) to create CMs, i.e., 2D images only containing the peaks. The resulting CM is typically binary, i.e. containing 0 for no peaks and 1 for peaks.
  3. Slide your query CM (based on short_audio_series) over each of your database CM (based on long_audio_series). For each time step count how many "stars" (i.e. 1s) align and store the count along with the slide offset (essentially the position of the short audio in the long audio).
  4. Pick the max count and return the corresponding short audio and position in the long audio. You will have to convert frame numbers back to seconds.

Example for the "slide" (untested sample code):

import numpy as np

scores = {}
cm_short = ...  # 2d constellation map for the short audio
cm_long = ...   # 2d constellation map for the long audio
# we assume that dim 0 is the time frame
# and dim 1 is the frequency bin
# both CMs contains only 0 or 1
frames_short = cm_short.shape[0]
frames_long = cm_long.shape[0]
for offset in range(frames_long-frames_short):
    cm_long_excerpt = cm_long[offset:offset+frames_short]
    score = np.sum(np.multiply(cm_long_excerpt, cm_short))
    scores[offset] = score
# TODO: find the highest score in "scores" and
# convert its offset back to seconds

Now, if your database is large, this will lead to way too many comparisons and you will also have to implement the hashing scheme, which is also described in the article I linked to above.

Note that the described procedure only matches identical recordings, but allows for noise and slight distortion. If that is not what you want, please define similarity a little better, because that could be all kinds of things (drum patterns, chord sequence, instrumentation, ...). A classic, DSP-based way to find similarities for these features is the following: Extract the appropriate feature for short frames (e.g. 256 samples) and then compute the similarity. E.g., if harmonic content is of interest to you, you could extract chroma vectors and then calculate a distance between chroma vectors, e.g., cosine distance. When you compute the similarity of each frame in your database signal with every frame in your query signal you end up with something similar to a self similarity matrix (SSM) or recurrence matrix (RM). Diagonal lines in the SSM/RM usually indicate similar sections.



来源:https://stackoverflow.com/questions/57317733/how-to-correlate-two-audio-events-detect-if-they-are-similar-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!