Applying neural network to MFCCs for variable-length speech segments

问题

I'm currently trying to create and train a neural network to perform simple speech classification using MFCCs.

At the moment, I'm using 26 coefficients for each sample, and a total of 5 different classes - these are five different words with varying numbers of syllables.

While each sample is 2 seconds long, I am unsure how to handle cases where the user can pronounce words either very slowly or very quickly. E.g., the word 'television' spoken within 1 second yields different coefficients than the word spoken within two seconds.

Any advice on how I can solve this problem would be much appreciated!

回答1:

I'm currently trying to create and train a neural network to perform simple speech classification using MFCCs.

Simple neural networks do not have input lenght invariance and do not allow to analyze time series.

For classification of time series like a series of MFCC frames you can use a classifier with time invariance. For example you can use neural networks combined with hidden Markov models (ANN-HMM), gaussian mixture model with hidden markov models (GMM-HMM) or recurrent neural networks (RNN). Matlab implementation for RNN is here. Theano implementation is also available. You can find a detailed description of those structures in Google.

Speech recognition is not a simple thing to implement, it is better to use existing software like CMUSphinx

来源：https://stackoverflow.com/questions/21645082/applying-neural-network-to-mfccs-for-variable-length-speech-segments

标签

matlab

neural-network

speech-recognition

mfcc

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!