My main goal is in feeding mfcc features to an ANN.
However I am stuck at the data pre processing step and my question has two parts.
I have an audio.
I have a txt file that has the annotation and time stamp like this:
0.0 2.5 Music
2.5 6.05 silence
6.05 8.34 notmusic
8.34 12.0 silence
12.0 15.5 music
I know for a single audio file, I can calculate the mfcc using librosa like this :
import librosa
y, sr = librosa.load('abcd.wav')
mfcc=librosa.feature.mfcc(y=y, sr=sr)
Part 1: I'm unable to wrap my head around two things :
how to calculate mfcc based on the segments from the annotations.
Part2: How to best store these mfcc's for passing them to keras DNN. i.e should all mfcc's calculated per audio segment be saved to a single list/dictionary. or is it better to save them to different dictionaries so that all mfcc's belonging to one label are at one place.
I'm new to audio processing and python so, i'm open to recommendations regarding best practices.
More than happy to provide additional details. Thanks.
Part 1: MFCC to tag conversion
It's not obvious from the librosa documentation but I believe the mfcc's are being calculated at about a 23mS frame rate. With your code above mfcc.shape
will return (20, x)
where 20 is the number of features and the x corresponds to x number of frames. The default hop_rate
for mfcc is 512 samples which means each mfcc sample spans about 23mS (512/sr).
Using this you can compute which frame goes with which tag in your text file. For example, the tag Music
goes from 0.0 to 2.5 seconds so that will be mfcc frame 0 to 2.5*sr/512 ~= 108. They will not come out exactly equal so you need to round the values.
Part 2A: DNN Data Format
For the input (mfcc data) you'll need to figure out what the input looks like. You'll have 20 features but do you want to input a single frame to your net or are you going to submit a time series. You're mfcc data is already a numpy array, however it's formatted as (feature, sample). You probably want to reverse that for input to Keras. You can use numpy.reshape
to do that.
For the output, you need assign a numeric value to each tag in your text file. Typically you would store the the tag to integer
in a dictionary. This will then be used to create your training output for the network. There should be one output integer for each input sample.
Part 2B: Saving the Data
The simplest way to do this is to use pickle
to save and the reload it later. I like to use a class to encapsulate the input, output and dictionary data but you can choose whatever works for you.