When I extract MFCCs from an audio the ouput is (13, 22)
. What does the number represent? Is it time frames ? I use librosa.
The code is use is:
Yes, it is time frames and mainly depends on how many samples you provide via y
and what hop_length
you choose.
Say you have 10s of audio sampled at 44.1 kHz (CD quality). When you load it with librosa, it gets resampled to 22,050 Hz (that's the librosa default) and downmixed to one channel (mono). When you then run something like a STFT, melspectrogram, or MFCC, so-called feature frames are computed.
The question is, how many (feature) frames do you get for your 10s of audio?
The deciding parameter for this is the hop_length
. For all the mentioned functions, librosa slides a window of a certain length (typically n_fft
) over the 1d audio signal, i.e., it looks at one shorter segment (or frame) at a time, computes features for this segment and moves on to the next segment. These segments are usually overlapping. The distance between two such segments is hop_length
and it is specified in number of samples. It may be identical to n_fft
, but often times hop_length
is half or even just a quarter of n_fft
. It allows you to control the temporal resolution of your features (the spectral resolution is controlled by n_fft
or n_mfcc
, depending on what you are actually computing).
10s of audio at 44.1 kHz are 441000 samples. But remember, librosa by default resamples to 22050 Hz, so it's actually only 220500 samples. How many times can we move a segment of some length over these 220500 samples, if we move it by 256 samples in each step? The precise number depends on how long the segment is. But let's ignore that for a second and assume that when we hit the end, we simply zero-pad the input so that we can still compute frames for as long as there is at least some input. Then the computation becomes trivial:
number_of_samples / hop_length = number_of_frames
So for our examples, this would be:
220500 / 256 = 861.3
So we get about 861 frames.
Note that you can make this computation even easier by computing the so-called frame_rate
. That's frames per second in Hz. It's:
frame_rate = sample_rate / hop_length = 86.13
To get the number of frames for your input simply multiple frame_rate
with the length of your audio and you're set (ignoring padding).
frames = frame_rate * audio_in_seconds