How to use a context window to segment a whole log Mel-spectrogram (ensuring the same number of segments for all the audios)?

后端未结

关注

 1  638

I have several audios with different duration. So I don\'t know how to ensure the same number N of segments of the audio. I\'m trying to implement an existing paper, so it\'

相关标签:

1条回答

我寻月下人不归

2021-01-16 17:48

Loop over the frames along the time axis, moving forward 30 frames at a time, and extracting a window of last 64 frames. At the start and end you need to either truncate or pad the data to get full frames.

import librosa
import numpy as np
import math

audio_file = librosa.util.example_audio_file()
y, sr = librosa.load(audio_file, sr=None, duration=5.0) # only load 5 seconds

n_mels = 64
n_fft = int(np.ceil(0.025*sr))
win_length = int(np.ceil(0.025*sr))
hop_length = int(np.ceil(0.010*sr))
window = 'hamming'

fmin = 20
fmax = 8000

S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
frames = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels, fmin=fmin, fmax=fmax) + 1e-6)


window_size = 64
window_hop = 30

# truncate at start and end to only have windows full data
# alternative would be to zero-pad
start_frame = window_size 
end_frame = window_hop * math.floor(float(frames.shape[1]) / window_hop)

for frame_idx in range(start_frame, end_frame, window_hop):

    window = frames[:, frame_idx-window_size:frame_idx]
    assert window.shape == (n_mels, window_size)
    print('classify window', frame_idx, window.shape)

will output

classify window 64 (64, 64)
classify window 94 (64, 64)
classify window 124 (64, 64)
...
classify window 454 (64, 64)

However the number of windows will depend on the length of the audio sample. So if it is important to only have the same number of windows, you need to make sure all audio samples are the same length.

0 讨论(0)