So here's the idea: you can generate a spectrogram from an audio file using shorttime Fourier transform (stft). Then some people have generated something called a "binary mask" to generate different audio (ie. with background noise removed etc.) from the inverse stft.
Here's what I understand:
stft is a simple equation that is applied to the audio file, which generates the information that can easily be displayed a spectrogram. By taking the inverse of the stft matrix, and multiplying it by a matrix of the same size (the binary matrix) you can create a new matrix with information to generate an audio file with the masked sound.
Once I do the matrix multiplication, how is the new audio file created?
It's not much but here's what I've got in terms of code:
from librosa import load
from librosa.core import stft, istft
y, sample_rate = load('1.wav')
spectrum = stft(y)
back_y = istft(spectrum)
Thank you, and here are some slides that got me this far. I'd appreciate it if you could give me an example/demo in python
Librosa's STFT is full-featured so unless you're very careful with how you manipulate the spectrum, you won't get a sensible output from its istft
.
Here's a pair of functions, stft
and istft
, that I wrote from scratch that represent the forward and inverse STFT, along with a helper method that gives you the time and frequency locations of each pixel in the STFT array, plus a demo:
import numpy as np
import numpy.fft as fft
def stft(x, Nwin, Nfft=None):
"""
Short-time Fourier transform: convert a 1D vector to a 2D array
The short-time Fourier transform (STFT) breaks a long vector into disjoint
chunks (no overlap) and runs an FFT (Fast Fourier Transform) on each chunk.
The resulting 2D array can
Parameters
----------
x : array_like
Input signal (expected to be real)
Nwin : int
Length of each window (chunk of the signal). Should be ≪ `len(x)`.
Nfft : int, optional
Zero-pad each chunk to this length before FFT. Should be ≥ `Nwin`,
(usually with small prime factors, for fastest FFT). Default: `Nwin`.
Returns
-------
out : complex ndarray
`len(x) // Nwin` by `Nfft` complex array representing the STFT of `x`.
See also
--------
istft : inverse function (convert a STFT array back to a data vector)
stftbins : time and frequency bins corresponding to `out`
"""
Nfft = Nfft or Nwin
Nwindows = x.size // Nwin
# reshape into array `Nwin` wide, and as tall as possible. This is
# optimized for C-order (row-major) layouts.
arr = np.reshape(x[:Nwindows * Nwin], (-1, Nwin))
stft = fft.rfft(arr, Nfft)
return stft
def stftbins(x, Nwin, Nfft=None, d=1.0):
"""
Time and frequency bins corresponding to short-time Fourier transform.
Call this with the same arguments as `stft`, plus one extra argument: `d`
sample spacing, to get the time and frequency axes that the output of
`stft` correspond to.
Parameters
----------
x : array_like
same as `stft`
Nwin : int
same as `stft`
Nfft : int, optional
same as `stft`
d : float, optional
Sample spacing of `x` (or 1 / sample frequency), units of seconds.
Default: 1.0.
Returns
-------
t : ndarray
Array of length `len(x) // Nwin`, in units of seconds, corresponding to
the first dimension (height) of the output of `stft`.
f : ndarray
Array of length `Nfft`, in units of Hertz, corresponding to the second
dimension (width) of the output of `stft`.
"""
Nfft = Nfft or Nwin
Nwindows = x.size // Nwin
t = np.arange(Nwindows) * (Nwin * d)
f = fft.rfftfreq(Nfft, d)
return t, f
def istft(stftArr, Nwin):
"""
Inverse short-time Fourier transform (ISTFT)
Given an array representing the output of `stft`, convert it back to the
original samples.
Parameters
----------
stftArr : ndarray
Output of `stft` (or something the same size)
Nwin : int
Same input as `stft`: length of each chunk that the STFT was calculated
over.
Returns
-------
y : ndarray
Data samples corresponding to STFT data.
See also:
stft : the forward transform
"""
arr = fft.irfft(stftArr)[:, :Nwin]
return np.reshape(arr, -1)
if __name__ == '__main__':
sampleRate = 100.0 # Hertz
N = 1024
Nwin = 64
# Generate a chirp: start frequency at 5 Hz and going down at 2 Hz/s
time = np.arange(N) / sampleRate # seconds
x = np.cos(2 * np.pi * time * (5 - 2 * 0.5 * time))
# Test with Nfft bigger than Nwin
Nfft = Nwin * 2
s = stft(x, Nwin, Nfft=Nfft)
y = istft(s, Nwin)
# Make sure the stft and istft are inverses. Caveat: `x` and `y` won't be
# the same length if `N/Nwin` isn't integral!
maxerr = np.max(np.abs(x - y))
assert (maxerr < np.spacing(1) * 10)
# Test `stftbins`
t, f = stftbins(x, Nwin, Nfft=Nfft, d=1 / sampleRate)
assert (len(t) == s.shape[0])
assert (len(f) == s.shape[1])
try:
import pylab as plt
plt.imshow(np.abs(s), aspect="auto", extent=[f[0], f[-1], t[-1], t[0]])
plt.xlabel('frequency (Hertz)')
plt.ylabel('time (seconds (start of chunk))')
plt.title('STFT with chirp example')
plt.show()
except ModuleNotFoundError:
pass
This is in a gist if that's easier for you to read.
The entire module assumes real-only data and uses Numpy's rfft
functions. You can definitely generalize this to complex data (or use librosa), but for your application (audio masking), using the real-only transforms makes it easier to ensure that everything works out and the output of the inverse STFT is real-only (it's easy to mess this up if you're doing the fully-general complex STFT, where you need to be careful in maintaining symmetries).
The demo first generates some test data and confirms that the istft
on the stft
of the data produces the data again. The test data is a chirp that starts at 5 Hz and goes down at 2 Hz per second, so over ~10 seconds of data, the chirp's frequency wraps around and ends up at around 15 Hz. The demo plots the STFT (by taking the absolute value of the STFT array):
So
- put this code in a
stft.py
file, - import it as
import stft
, - compute an STFT as
spectrum = stft.stft(y, 128)
, - visualize your spectrum as shown in the demo (be sure to prepend
stft.
to functions defined instft.py
!), - pick what frequencies you want to attenuate/amplify and apply those effects on the
spectrum
array, before - finally getting the processed audio via
back_y = stft.istft(spectrum, 128)
.
Masking/amplifying/attenuating frequency content means just scaling some bins of the spectrum
array. If you have specific questions on how to do that, let us know. But this hopefully will give you a foolproof way of applying arbitrary effects.
If you really want to use librosa's functions, let us know and we can show you how to do that too.
来源:https://stackoverflow.com/questions/51655119/how-do-i-apply-a-binary-mask-and-stft-to-produce-an-audio-file