问题
I'm currently developing a speech recognition project and I'm trying to select the most meaningful features. Most of the relevant papers suggest using Zero Crossing Rates, F0, and MFCC features therefore I'm using those. My question is, a training sample with duration of 00:03 has 268 features. Considering I'm doing a multi class classification project with 50+ samples per class training including all MFCC features may suffer the project from curse of dimensionality or 'reduce the importance' of the other features. So my question is, should I include all MFCC features if not can you suggest an alternative?
回答1:
You should not use f0 and zero crossing, they are too unstable. You can simply increase your training data and use mfccs, they have good representation capabilitites. But remember to mean-normalize them.
回答2:
After getting the MFCC coefficient of each frame, you can represent as MFCC features as the combination of:
1) First 12 MFCC 2) 1 energy feature 3) 12 delta MFCC feature 4) 12 double-delta MFCC feature 5) 1 delta energy feature 6) 1 double delta energy feature
The concent of delta MFCC feature is described in this link.
The 39 dimension MFCC feature is feed into HMM or Recurrent Neural Network.
回答3:
The point I'd like to make is that MFCCs are not required. You can use MFCCs, and you can use the energy, delta and delta-delta features, as mentioned by @Mahendra Thapa but it is not "required". Some researchers use 40 CCs, some drop the DCT from MFCC calculation making it MFSCs (spectral not cepstral). Some add extra features. Some use less. Susceptibility to the curse of dimensionality depends on the your classifier, doesn't it? Some recently even claim to have made progress towards the "holy grail" of speech recognition, to train using the raw signal, using deep learning, learning the best features rather than hand-crafting them.
回答4:
MFCC is widely used,and the effect is relatively better .
来源:https://stackoverflow.com/questions/38833661/are-mfcc-features-required-for-speech-recognition