Is there a fast way to find (not necessarily recognize) human speech in an audio file?

前端 未结 3 1295
感动是毒
感动是毒 2021-02-01 08:59

I want to write a program that automatically syncs unsynced subtitles. One of the solutions I thought of is to somehow algorythmically find human speech and adjust the subtiles

3条回答
  •  清歌不尽
    2021-02-01 09:36

    You could run a window across your audio file, and try to extract what fraction of power of the total signal is human vocal frequency ( fundamental frequencies lie between 50 and 300 Hz) . The following is to give intuition and is untested on real audio.

    import scipy.fftpack as sf
    import numpy as np
    def hasHumanVoice(X, threshold, F_sample, Low_cutoff=50, High_cutoff= 300):
            """ Searching presence of frequencies on a real signal using FFT
            Inputs
            =======
            X: 1-D numpy array, the real time domain audio signal (single channel time series)
            Low_cutoff: float, frequency components below this frequency will not pass the filter (physical frequency in unit of Hz)
            High_cutoff: float, frequency components above this frequency will not pass the filter (physical frequency in unit of Hz)
            F_sample: float, the sampling frequency of the signal (physical frequency in unit of Hz)
            threshold: Has to be standardized once to say how much power must be there in real vocal signal frequencies.    
            """        
    
            M = X.size # let M be the length of the time series
            Spectrum = sf.rfft(X, n=M) 
            [Low_cutoff, High_cutoff, F_sample] = map(float, [Low_cutoff, High_cutoff, F_sample])
    
            #Convert cutoff frequencies into points on spectrum
            [Low_point, High_point] = map(lambda F: F/F_sample * M, [Low_cutoff, High_cutoff])
    
            totalPower = np.sum(Spectrum)
            fractionPowerInSignal = np.sum(Spectrum[Low_point : High_point])/totalPower # Calculating fraction of power in these frequencies
    
            if fractionPowerInSignal > threshold:
                return 1
            else:
                return 0
    
    voiceVector = []
    for window in fullAudio: # Run a window of appropriate length across the audio file
        voiceVector.append (hasHumanVoice( window, threshold, samplingRate)
    

提交回复
热议问题