Is there a fast way to find (not necessarily recognize) human speech in an audio file?

前端未结

关注

 3  1295

感动是毒 2021-02-01 08:59

I want to write a program that automatically syncs unsynced subtitles. One of the solutions I thought of is to somehow algorythmically find human speech and adjust the subtiles

3条回答

清歌不尽 (楼主)

2021-02-01 09:36

You could run a window across your audio file, and try to extract what fraction of power of the total signal is human vocal frequency ( fundamental frequencies lie between 50 and 300 Hz) . The following is to give intuition and is untested on real audio.

import scipy.fftpack as sf
import numpy as np
def hasHumanVoice(X, threshold, F_sample, Low_cutoff=50, High_cutoff= 300):
        """ Searching presence of frequencies on a real signal using FFT
        Inputs
        =======
        X: 1-D numpy array, the real time domain audio signal (single channel time series)
        Low_cutoff: float, frequency components below this frequency will not pass the filter (physical frequency in unit of Hz)
        High_cutoff: float, frequency components above this frequency will not pass the filter (physical frequency in unit of Hz)
        F_sample: float, the sampling frequency of the signal (physical frequency in unit of Hz)
        threshold: Has to be standardized once to say how much power must be there in real vocal signal frequencies.    
        """        

        M = X.size # let M be the length of the time series
        Spectrum = sf.rfft(X, n=M) 
        [Low_cutoff, High_cutoff, F_sample] = map(float, [Low_cutoff, High_cutoff, F_sample])

        #Convert cutoff frequencies into points on spectrum
        [Low_point, High_point] = map(lambda F: F/F_sample * M, [Low_cutoff, High_cutoff])

        totalPower = np.sum(Spectrum)
        fractionPowerInSignal = np.sum(Spectrum[Low_point : High_point])/totalPower # Calculating fraction of power in these frequencies

        if fractionPowerInSignal > threshold:
            return 1
        else:
            return 0

voiceVector = []
for window in fullAudio: # Run a window of appropriate length across the audio file
    voiceVector.append (hasHumanVoice( window, threshold, samplingRate)

0 讨论(0)

查看其它3个回答