问题
Basically I have trained a few models using keras to do isolated word recognition. Currently i can record the audio using sound device record function for a pre-fixed duration and save the audio file as a wav file. I have implemented silence detection to trim out unwanted samples. But this is all working after the whole recording is complete. I would like to get the trimmed audio segments immediately while recording simultaneously so that i can do speech recognition in real-time. I'm using python2 and tensorflow 1.14.0. Below is the snippet of what i currently have,
import sounddevice as sd
import matplotlib.pyplot as plt
import time
#import tensorflow.keras.backend as K
import numpy as np
from scipy.io.wavfile import write
from scipy.io.wavfile import read
from scipy.io import wavfile
from pydub import AudioSegment
import cv2
import tensorflow as tf
tf.compat.v1.enable_eager_execution()
tf.compat.v1.enable_v2_behavior()
from contextlib import closing
import multiprocessing
models=['model1.h5','model2.h5','model3.h5','model4.h5','model5.h5']
loaded_models=[]
for model in models:
loaded_models.append(tf.keras.models.load_model(model))
def prediction(model_ip):
model,t=model_ip
ret_val=model.predict(t).tolist()[0]
return ret_val
print("recording in 5sec")
time.sleep(5)
fs = 44100 # Sample rate
seconds = 10 # Duration of recording
print('recording')
time.sleep(0.5)
myrecording = sd.rec(int(seconds * fs), samplerate=fs, channels=1)
sd.wait()
thresh=0.025
gaplimit=9000
wav_file='/home/nick/Desktop/Endpoint/aud.wav'
write(wav_file, fs, myrecording)
fs,myrecording = read(wav_file)[0], read(wav_file)[1]
#Now the silence removal function is called which trims and saves only the useful audio samples in the form of a wav file. This trimmed audio contains the full word which can be recognized.
end_points(wav_file,thresh,50)
#Below for loop combines the loaded models(I'm using multiple models) with the input in a tuple
for trimmed_aud in trimmed_audio:
...
... # The trimmed audio is processed further and the input which the model can predict
#is t
...
modelon=[]
for md in loaded_models:
modelon.append((md,t))
start_time=time.time()
with closing(multiprocessing.Pool()) as p:
predops=p.map(prediction,modelon)
print('Total time taken: {}'.format(time.time() - start_time))
actops=[]
for predop in predops:
actops.append(predop.index(max(predop)))
print(actops)
max_freqq = max(set(actops), key = actops.count)
final_ans+=str(max_freqq)
print("Output: {}".format(final_ans))
Note that the above code only includes what is relevant to the question and will not run. I wanted to give an overview of what i have so far and would really appreciate your inputs on how i can proceed to be able to record and trim audio based on a threshold simultaneously so that if multiple words are spoken within the recording duration of 10 seconds(seconds variable in code), as i speak when the energy of the samples for a window size of 50ms goes below a certain threshold i cut the audio at those two points, trim and use it for prediction. Both recording and prediction of trimmed audio segments must happen simultaneously so that the each output word can be displayed immediately after its utterance during the 10 seconds of recording. Would really appreciate any suggestions on how I can go about this.
回答1:
Hard to say what your model architecture is but there are models specifically designed for streaming recognition. Like Facebook's streaming convnets. You won't be able to implement them in Keras easily though.
来源:https://stackoverflow.com/questions/60951638/how-to-simultaneously-read-audio-samples-while-recording-in-python-for-real-time