What is the second number in the MFCCs array?

二次信任 提交于 2020-11-29 10:18:04

问题


When I extract MFCCs from an audio the ouput is (13, 22). What does the number represent? Is it time frames ? I use librosa.

The code is use is:

mfccs = librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=13, hop_length=256)
mfccs


print(mfccs.shape)

And the ouput is (13,22).


回答1:


Yes, it is time frames and mainly depends on how many samples you provide via y and what hop_length you choose.

Example

Say you have 10s of audio sampled at 44.1 kHz (CD quality). When you load it with librosa, it gets resampled to 22,050 Hz (that's the librosa default) and downmixed to one channel (mono). When you then run something like a STFT, melspectrogram, or MFCC, so-called feature frames are computed.

The question is, how many (feature) frames do you get for your 10s of audio?

The deciding parameter for this is the hop_length. For all the mentioned functions, librosa slides a window of a certain length (typically n_fft) over the 1d audio signal, i.e., it looks at one shorter segment (or frame) at a time, computes features for this segment and moves on to the next segment. These segments are usually overlapping. The distance between two such segments is hop_length and it is specified in number of samples. It may be identical to n_fft, but often times hop_length is half or even just a quarter of n_fft. It allows you to control the temporal resolution of your features (the spectral resolution is controlled by n_fft or n_mfcc, depending on what you are actually computing).

10s of audio at 44.1 kHz are 441000 samples. But remember, librosa by default resamples to 22050 Hz, so it's actually only 220500 samples. How many times can we move a segment of some length over these 220500 samples, if we move it by 256 samples in each step? The precise number depends on how long the segment is. But let's ignore that for a second and assume that when we hit the end, we simply zero-pad the input so that we can still compute frames for as long as there is at least some input. Then the computation becomes trivial:

number_of_samples / hop_length = number_of_frames

So for our examples, this would be:

220500 / 256 = 861.3

So we get about 861 frames.

Note that you can make this computation even easier by computing the so-called frame_rate. That's frames per second in Hz. It's:

frame_rate = sample_rate / hop_length = 86.13

To get the number of frames for your input simply multiple frame_rate with the length of your audio and you're set (ignoring padding).

frames = frame_rate * audio_in_seconds


来源:https://stackoverflow.com/questions/62727244/what-is-the-second-number-in-the-mfccs-array

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!