Find sound effect inside an audio file

后端 未结 4 998
攒了一身酷
攒了一身酷 2021-01-15 03:12

I have a load of 3 hour MP3 files, and every ~15 minutes a distinct 1 second sound effect is played, which signals the beginning of a new chapter.

Is it possible to

相关标签:
4条回答
  • 2021-01-15 03:46

    This is an Audio Event Detection problem. If the sound is always the same and there are no other sounds at the same time, it can probably be solved with a Template Matching approach. At least if there is no other sounds with other meanings that sound similar.

    The simplest kind of template matching is to compute the cross-correlation between your input signal and the template.

    1. Cut out an example of the sound to detect (using Audacity). Take as much as possible, but avoid the start and end. Store this as .wav file
    2. Load the .wav template using librosa.load()
    3. Chop up the input file into a series of overlapping frames. Length should be same as your template. Can be done with librosa.util.frame
    4. Iterate over the frames, and compute cross-correlation between frame and template using numpy.correlate.
    5. High values of cross-correlation indicate a good match. A threshold can be applied in order to decide what is an event or not. And the frame number can be used to calculate the time of the event.

    You should probably prepare some shorter test files which have both some examples of the sound to detect as well as other typical sounds.

    If the volume of the recordings is inconsistent you'll want to normalize that before running detection.

    If cross-correlation in the time-domain does not work, you can compute the melspectrogram or MFCC features and cross-correlate that. If this does not yield OK results either, a machine learning model can be trained using supervised learning, but this requires labeling a bunch of data as event/not-event.

    0 讨论(0)
  • 2021-01-15 03:49

    Trying to directly match waveforms samples in the time domain is not a good idea. The mp3 signal will preserve the perceptual properties but it is quite likely the phases of the frequency components will be shifted so the sample values will not match.

    You could try trying to match the volume envelopes of your effect and your sample. This is less likely to be affected by the mp3 process.

    First, normalise your sample so the embedded effects are the same level as your reference effect. Constructing new waveforms from the effect and the sample by using the average of the peak values over time frames that are just short enough to capture the relevant features. Better still use overlapping frames. Then use cross-correlation in the time domain.

    If this does not work then you could analyze each frame using an FFT this gives you a feature vector for each frame. You then try to find matches of the sequence of features in your effect with the sample. Similar to https://stackoverflow.com/users/1967571/jonnor suggestion. MFCC is used in speech recognition but since you are not detecting speech FFT is probably OK.

    I am assuming the effect playing by itself (no background noise) and it is added to the recording electronically (as opposed to being recorded via a microphone). If this is not the case the problem becomes more difficult.

    0 讨论(0)
  • 2021-01-15 03:52

    To follow up on the answers by @jonnor and @paul-john-leonard, they are both correct, by using frames (FFT) I was able to do Audio Event Detection.

    I've written up the full source code at:

    https://github.com/craigfrancis/audio-detect

    Some notes though:

    • To create the templates, I used ffmpeg:

      ffmpeg -ss 13.15 -i source.mp4 -t 0.8 -acodec copy -y templates/01.mp4;

    • I decided to use librosa.core.stft, but I needed to make my own implementation of this stft function for the 3 hour file I'm analysing, as it's far too big to keep in memory.

    • When using stft I tried using a hop_length of 64 at first, rather than the default (512), as I assumed that would give me more data to work with... the theory might be true, but 64 was far too detailed, and caused it to fail most of the time.

    • I still have no idea how to get cross-correlation between frame and template to work (via numpy.correlate)... instead I took the results per frame (the 1025 buckets, not 1024, which I believe relate to the Hz frequencies found) and did a very simple average difference check, then ensured that average was above a certain value (my test case worked at 0.15, the main files I'm using this on required 0.55 - presumably because the main files had been compressed quite a bit more):

      hz_score = abs(source[0:1025,x] - template[2][0:1025,y])
      hz_score = sum(hz_score)/float(len(hz_score))

    • When checking these scores, it's really useful to show them on a graph. I often used something like the following:

      import matplotlib.pyplot as plt
      plt.figure(figsize=(30, 5))
      plt.axhline(y=hz_match_required_start, color='y')

      while x < source_length:
      debug.append(hz_score)
      if x == mark_frame:
      plt.axvline(x=len(debug), ymin=0.1, ymax=1, color='r')

      plt.plot(debug)
      plt.show()

    • When you create the template, you need to trim off any leading silence (to avoid bad matching), and an extra ~5 frames (it seems that the compression / re-encoding process alters this)... likewise, remove the last 2 frames (I think the frames include a bit of data from their surroundings, where the last one in particular can be a bit off).

    • When you start finding a match, you might find it's ok for the first few frames, then it fails... you will probably need to try again a frame or two later. I found it easier having a process that supported multiple templates (slight variations on the sound), and would check their first testable (e.g. 6th) frame and if that matched, put them in a list of potential matches. Then, as it progressed on to the next frames of the source, it could compare it to the next frames of the template, until all frames in the template had been matched (or failed).

    0 讨论(0)
  • 2021-01-15 04:01

    This might not be an answer, it's just where I got to before I start researching the answers by @jonnor and @paul-john-leonard.

    I was looking at the Spectrograms you can get by using librosa stft and amplitude_to_db, and thinking that if I take the data that goes in to the graphs, with a bit of rounding, I could potentially find the 1 sound effect being played:

    https://librosa.github.io/librosa/generated/librosa.display.specshow.html

    The code I've written below kind of works; although it:

    1. Does return quite a few false positives, which might be fixed by tweaking the parameters of what is considered a match.

    2. I would need to replace the librosa functions with something that can parse, round, and do the match checks in one pass; as a 3 hour audio file causes python to run out of memory on a computer with 16GB of RAM after ~30 minutes before it even got to the rounding bit.


    import sys
    import numpy
    import librosa
    
    #--------------------------------------------------
    
    if len(sys.argv) == 3:
        source_path = sys.argv[1]
        sample_path = sys.argv[2]
    else:
        print('Missing source and sample files as arguments');
        sys.exit()
    
    #--------------------------------------------------
    
    print('Load files')
    
    source_series, source_rate = librosa.load(source_path) # The 3 hour file
    sample_series, sample_rate = librosa.load(sample_path) # The 1 second file
    
    source_time_total = float(len(source_series) / source_rate);
    
    #--------------------------------------------------
    
    print('Parse Data')
    
    source_data_raw = librosa.amplitude_to_db(abs(librosa.stft(source_series, hop_length=64)))
    sample_data_raw = librosa.amplitude_to_db(abs(librosa.stft(sample_series, hop_length=64)))
    
    sample_height = sample_data_raw.shape[0]
    
    #--------------------------------------------------
    
    print('Round Data') # Also switches X and Y indexes, so X becomes time.
    
    def round_data(raw, height):
    
        length = raw.shape[1]
    
        data = [];
    
        range_length = range(1, (length - 1))
        range_height = range(1, (height - 1))
    
        for x in range_length:
    
            x_data = []
    
            for y in range_height:
    
                # neighbours = []
                # for a in [(x - 1), x, (x + 1)]:
                #     for b in [(y - 1), y, (y + 1)]:
                #         neighbours.append(raw[b][a])
                #
                # neighbours = (sum(neighbours) / len(neighbours));
                #
                # x_data.append(round(((raw[y][x] + raw[y][x] + neighbours) / 3), 2))
    
                x_data.append(round(raw[y][x], 2))
    
            data.append(x_data)
    
        return data
    
    source_data = round_data(source_data_raw, sample_height)
    sample_data = round_data(sample_data_raw, sample_height)
    
    #--------------------------------------------------
    
    sample_data = sample_data[50:268] # Temp: Crop the sample_data (318 to 218)
    
    #--------------------------------------------------
    
    source_length = len(source_data)
    sample_length = len(sample_data)
    sample_height -= 2;
    
    source_timing = float(source_time_total / source_length);
    
    #--------------------------------------------------
    
    print('Process series')
    
    hz_diff_match = 18 # For every comparison, how much of a difference is still considered a match - With the Source, using Sample 2, the maximum diff was 66.06, with an average of ~9.9
    
    hz_match_required_switch = 30 # After matching "start" for X, drop to the lower "end" requirement
    hz_match_required_start = 850 # Out of a maximum match value of 1023
    hz_match_required_end = 650
    hz_match_required = hz_match_required_start
    
    source_start = 0
    sample_matched = 0
    
    x = 0;
    while x < source_length:
    
        hz_matched = 0
        for y in range(0, sample_height):
            diff = source_data[x][y] - sample_data[sample_matched][y];
            if diff < 0:
                diff = 0 - diff
            if diff < hz_diff_match:
                hz_matched += 1
    
        # print('  {} Matches - {} @ {}'.format(sample_matched, hz_matched, (x * source_timing)))
    
        if hz_matched >= hz_match_required:
    
            sample_matched += 1
    
            if sample_matched >= sample_length:
    
                print('      Found @ {}'.format(source_start * source_timing))
    
                sample_matched = 0 # Prep for next match
    
                hz_match_required = hz_match_required_start
    
            elif sample_matched == 1: # First match, record where we started
    
                source_start = x;
    
            if sample_matched > hz_match_required_switch:
    
                hz_match_required = hz_match_required_end # Go to a weaker match requirement
    
        elif sample_matched > 0:
    
            # print('  Reset {} / {} @ {}'.format(sample_matched, hz_matched, (source_start * source_timing)))
    
            x = source_start # Matched something, so try again with x+1
    
            sample_matched = 0 # Prep for next match
    
            hz_match_required = hz_match_required_start
    
        x += 1
    
    #--------------------------------------------------
    
    0 讨论(0)
提交回复
热议问题