问题
Using librosa, I created mfcc for my audio file as follows:
import librosa
y, sr = librosa.load('myfile.wav')
print y
print sr
mfcc=librosa.feature.mfcc(y=y, sr=sr)
I also have a text file that contains manual annotations[start, stop, tag] corresponding to the audio as follows:
0.0 2.0 sound1
2.0 4.0 sound2
4.0 6.0 silence
6.0 8.0 sound1
QUESTION: How to do I combine the generated mfcc's that was generated by librosa, with the annotations from text file.
Final goal is, I want to combine mfcc corresponding to the label, and pass
it to a neural network.
So a neural network will have the mfcc and corresponding label as training data.
If it was one dimensional , I could have N columns with N values and the final Column Y with a Class label. But i'm confused how to proceed, as the mfcc has the shape of something like (16, X) or (20, Y). So I don't know how to combine the two.
My sample mfcc's are here : https://gist.github.com/manbharae/0a53f8dfef6055feef1d8912044e1418
Please help thank you.
Update : Objective is to train a neural network so that it can identify a new sound when it encounters it in the future.
I googled and found that mfcc are very good for speech. However my audio has speech but I want to indentify non speech. Are there any other recommended audio features for a general purpose audio classification/recognition task?
回答1:
Try the following. The explanation is included in the code.
import numpy
import librosa
# The following function returns a label index for a point in time (tp)
# this is psuedo code for you to complete
def getLabelIndexForTime(tp):
# search the loaded annoations for what label corresponsons to the given time
# convert the label to an index that represents its unqiue value in the set
# ie.. 'sound1' = 0, 'sound2' = 1, ...
#print tp #for debug
label_index = 0 #replace with logic above
return label_index
if __name__ == '__main__':
# Load the waveforms samples and convert to mfcc
raw_samples, sample_rate = librosa.load('Front_Right.wav')
mfcc = librosa.feature.mfcc(y=raw_samples, sr=sample_rate)
print 'Wave duration is %4.2f seconds' % (len(raw_samples)/float(sample_rate))
# Create the network's input training data, X
# mfcc is organized (feature, sample) but the net needs (sample, feature)
# X is mfcc reorganized to (sample, feature)
X = numpy.moveaxis(mfcc, 1, 0)
print 'mfcc.shape:', mfcc.shape
print 'X.shape: ', X.shape
# Note that 512 samples is the default 'hop_length' used in calculating
# the mfcc so each mfcc spans 512/sample_rate seconds.
mfcc_samples = mfcc.shape[1]
mfcc_span = 512/float(sample_rate)
print 'MFCC calculated duration is %4.2f seconds' % (mfcc_span*mfcc_samples)
# for 'n' network input samples, calculate the time point where they occur
# and get the appropriate label index for them.
# Use +0.5 to get the middle of the mfcc's point in time.
Y = []
for sample_num in xrange(mfcc_samples):
time_point = (sample_num + 0.5) * mfcc_span
label_index = getLabelIndexForTime(time_point)
Y.append(label_index)
Y = numpy.array(Y)
# Y now contains the network's output training values
# !Note for some nets you may need to convert this to one-hot format
print 'Y.shape: ', Y.shape
assert Y.shape[0] == X.shape[0] # X and Y have the same number of samples
# Train the net with something like...
# model.fit(X, Y, ... #ie.. for a Keras NN model
I should mention that here the Y
data is intended to be used in a network that has a softmax output that can be trained with integer label data. Keras models accept this with a sparse_categorical_crossentropy
loss function (I believe the loss function internally converts it to one-hot encoding). Other frameworks require the Y
training labels to be delivered alreading in one-hot encoding format. This is more common. There's lots of examples on how to do the conversion. For your case you could do something like...
Yoh = numpy.zeros(shape=(Y.shape[0], num_label_types), dtype='float32')
for i, val in enumerate(Y):
Yoh[i, val] = 1.0
As for mfcc's being acceptable for classifying non-speech, I would expect them to work but you may want to try modifying their parameters, ie.. librosa allows you do something like n_mfcc=40
so you get 40 features instead of just 20. For fun, you might try replacing the mfcc with a simple FFT of the same size (512 samples) and see which works the best.
来源:https://stackoverflow.com/questions/48388641/how-to-combine-mfcc-vector-with-labels-from-annotation-to-pass-to-a-neural-netwo