My input data is speech data. So it has tons of words people say. Though, when i use the function librosa, the output, time serious audio is used to [ 0 0 0