I want to write a program that automatically syncs unsynced subtitles. One of the solutions I thought of is to somehow algorythmically find human speech and adjust the subtiles
You could run a window across your audio file, and try to extract what fraction of power of the total signal is human vocal frequency ( fundamental frequencies lie between 50 and 300 Hz) . The following is to give intuition and is untested on real audio.
import scipy.fftpack as sf
import numpy as np
def hasHumanVoice(X, threshold, F_sample, Low_cutoff=50, High_cutoff= 300):
""" Searching presence of frequencies on a real signal using FFT
Inputs
=======
X: 1-D numpy array, the real time domain audio signal (single channel time series)
Low_cutoff: float, frequency components below this frequency will not pass the filter (physical frequency in unit of Hz)
High_cutoff: float, frequency components above this frequency will not pass the filter (physical frequency in unit of Hz)
F_sample: float, the sampling frequency of the signal (physical frequency in unit of Hz)
threshold: Has to be standardized once to say how much power must be there in real vocal signal frequencies.
"""
M = X.size # let M be the length of the time series
Spectrum = sf.rfft(X, n=M)
[Low_cutoff, High_cutoff, F_sample] = map(float, [Low_cutoff, High_cutoff, F_sample])
#Convert cutoff frequencies into points on spectrum
[Low_point, High_point] = map(lambda F: F/F_sample * M, [Low_cutoff, High_cutoff])
totalPower = np.sum(Spectrum)
fractionPowerInSignal = np.sum(Spectrum[Low_point : High_point])/totalPower # Calculating fraction of power in these frequencies
if fractionPowerInSignal > threshold:
return 1
else:
return 0
voiceVector = []
for window in fullAudio: # Run a window of appropriate length across the audio file
voiceVector.append (hasHumanVoice( window, threshold, samplingRate)
The technical term for what you are trying to do is called Voice Activity Detection (VAD). There is a python library called SPEAR that does it (among other things).
webrtcvad is a Python wrapper around Google's excellent WebRTC Voice Activity Detection (VAD) implementation--it does the best job of any VAD I've used as far as correctly classifying human speech, even with noisy audio.
To use it for your purpose, you would do something like this:
vad = webrtcvad.Vad()
vad.is_speech(chunk, sample_rate)
The VAD output may be "noisy", and if it classifies a single 30 millisecond chunk of audio as speech you don't really want to output a time for that. You probably want to look over the past 0.3 seconds (or so) of audio and see if the majority of 30 millisecond chunks in that period are classified as speech. If they are, then you output the start time of that 0.3 second period as the beginning of speech. Then you do something similar to detect when the speech ends: Wait for a 0.3 second period of audio where the majority of 30 millisecond chunks are not classified as speech by the VAD--when that happens, output the end time as the end of speech.
You may have to tweak the timing a little bit to get good results for your purposes--maybe you decide that you need 0.2 seconds of audio where more than 30% of chunks are classified as speech by the VAD before you trigger, and 1.0 seconds of audio with more than 50% of chunks classified as non-speech before you de-trigger.
A ring buffer (collections.deque
in Python) is a helpful data structure for keeping track of the last N chunks of audio and their classification.