I have several audios with different duration. So I don\'t know how to ensure the same number N of segments of the audio. I\'m trying to implement an existing paper, so it\'
Loop over the frames along the time axis, moving forward 30 frames at a time, and extracting a window of last 64 frames. At the start and end you need to either truncate or pad the data to get full frames.
import librosa
import numpy as np
import math
audio_file = librosa.util.example_audio_file()
y, sr = librosa.load(audio_file, sr=None, duration=5.0) # only load 5 seconds
n_mels = 64
n_fft = int(np.ceil(0.025*sr))
win_length = int(np.ceil(0.025*sr))
hop_length = int(np.ceil(0.010*sr))
window = 'hamming'
fmin = 20
fmax = 8000
S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
frames = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels, fmin=fmin, fmax=fmax) + 1e-6)
window_size = 64
window_hop = 30
# truncate at start and end to only have windows full data
# alternative would be to zero-pad
start_frame = window_size
end_frame = window_hop * math.floor(float(frames.shape[1]) / window_hop)
for frame_idx in range(start_frame, end_frame, window_hop):
window = frames[:, frame_idx-window_size:frame_idx]
assert window.shape == (n_mels, window_size)
print('classify window', frame_idx, window.shape)
will output
classify window 64 (64, 64)
classify window 94 (64, 64)
classify window 124 (64, 64)
...
classify window 454 (64, 64)
However the number of windows will depend on the length of the audio sample. So if it is important to only have the same number of windows, you need to make sure all audio samples are the same length.