问题
I'm currently trying to create and train a neural network to perform simple speech classification using MFCCs.
At the moment, I'm using 26 coefficients for each sample, and a total of 5 different classes - these are five different words with varying numbers of syllables.
While each sample is 2 seconds long, I am unsure how to handle cases where the user can pronounce words either very slowly or very quickly. E.g., the word 'television' spoken within 1 second yields different coefficients than the word spoken within two seconds.
Any advice on how I can solve this problem would be much appreciated!
回答1:
I'm currently trying to create and train a neural network to perform simple speech classification using MFCCs.
Simple neural networks do not have input lenght invariance and do not allow to analyze time series.
For classification of time series like a series of MFCC frames you can use a classifier with time invariance. For example you can use neural networks combined with hidden Markov models (ANN-HMM), gaussian mixture model with hidden markov models (GMM-HMM) or recurrent neural networks (RNN). Matlab implementation for RNN is here. Theano implementation is also available. You can find a detailed description of those structures in Google.
Speech recognition is not a simple thing to implement, it is better to use existing software like CMUSphinx
来源:https://stackoverflow.com/questions/21645082/applying-neural-network-to-mfccs-for-variable-length-speech-segments