I have a wav conversation of 2 people(customer and tech support) I have 3 separate functions that extract 1 voice, cut 10 seconds and transform it to embedding.
I figured it out - the function below just works without saving, buffer etc. It receives a wav file and edits it and just sends straight to the get math embedding function:
def get_customer_voice_and_cutting_10_seconds_embedding(file):
print('getting customer voice only')
wav = read(file)
ch = wav[1].shape[1]
sr = wav[0]
c1 = wav[1][:,1]
vad = VoiceActivityDetection()
vad.process(c1)
voice_samples = vad.get_voice_samples()
audio_segment = AudioSegment(voice_samples.tobytes(), frame_rate=sr,sample_width=voice_samples.dtype.itemsize, channels=1)
audio_segment = audio_segment[0:10000]
file = str(file) + '_10seconds.wav'
return get_embedding(file)
the key is tobytes() in Audio segment, it just assembles all them together in 1 track again