What is the second number in the MFCCs array?

前端 未结 1 1899
南笙
南笙 2021-01-25 08:04

When I extract MFCCs from an audio the ouput is (13, 22). What does the number represent? Is it time frames ? I use librosa.

The code is use is:



        
相关标签:
1条回答
  • 2021-01-25 08:15

    Yes, it is time frames and mainly depends on how many samples you provide via y and what hop_length you choose.

    Example

    Say you have 10s of audio sampled at 44.1 kHz (CD quality). When you load it with librosa, it gets resampled to 22,050 Hz (that's the librosa default) and downmixed to one channel (mono). When you then run something like a STFT, melspectrogram, or MFCC, so-called feature frames are computed.

    The question is, how many (feature) frames do you get for your 10s of audio?

    The deciding parameter for this is the hop_length. For all the mentioned functions, librosa slides a window of a certain length (typically n_fft) over the 1d audio signal, i.e., it looks at one shorter segment (or frame) at a time, computes features for this segment and moves on to the next segment. These segments are usually overlapping. The distance between two such segments is hop_length and it is specified in number of samples. It may be identical to n_fft, but often times hop_length is half or even just a quarter of n_fft. It allows you to control the temporal resolution of your features (the spectral resolution is controlled by n_fft or n_mfcc, depending on what you are actually computing).

    10s of audio at 44.1 kHz are 441000 samples. But remember, librosa by default resamples to 22050 Hz, so it's actually only 220500 samples. How many times can we move a segment of some length over these 220500 samples, if we move it by 256 samples in each step? The precise number depends on how long the segment is. But let's ignore that for a second and assume that when we hit the end, we simply zero-pad the input so that we can still compute frames for as long as there is at least some input. Then the computation becomes trivial:

    number_of_samples / hop_length = number_of_frames
    

    So for our examples, this would be:

    220500 / 256 = 861.3
    

    So we get about 861 frames.

    Note that you can make this computation even easier by computing the so-called frame_rate. That's frames per second in Hz. It's:

    frame_rate = sample_rate / hop_length = 86.13
    

    To get the number of frames for your input simply multiple frame_rate with the length of your audio and you're set (ignoring padding).

    frames = frame_rate * audio_in_seconds
    
    0 讨论(0)
提交回复
热议问题