How do I apply a binary mask and STFT to produce an audio file?

问题

So here's the idea: you can generate a spectrogram from an audio file using shorttime Fourier transform (stft). Then some people have generated something called a "binary mask" to generate different audio (ie. with background noise removed etc.) from the inverse stft.

Here's what I understand:

stft is a simple equation that is applied to the audio file, which generates the information that can easily be displayed a spectrogram. By taking the inverse of the stft matrix, and multiplying it by a matrix of the same size (the binary matrix) you can create a new matrix with information to generate an audio file with the masked sound.

Once I do the matrix multiplication, how is the new audio file created?

It's not much but here's what I've got in terms of code:

from librosa import load
from librosa.core import stft, istft
y, sample_rate = load('1.wav')
spectrum = stft(y)
back_y = istft(spectrum)

Thank you, and here are some slides that got me this far. I'd appreciate it if you could give me an example/demo in python

回答1:

Librosa's STFT is full-featured so unless you're very careful with how you manipulate the spectrum, you won't get a sensible output from its istft.

Here's a pair of functions, stft and istft, that I wrote from scratch that represent the forward and inverse STFT, along with a helper method that gives you the time and frequency locations of each pixel in the STFT array, plus a demo:

import numpy as np
import numpy.fft as fft


def stft(x, Nwin, Nfft=None):
    """
    Short-time Fourier transform: convert a 1D vector to a 2D array

    The short-time Fourier transform (STFT) breaks a long vector into disjoint
    chunks (no overlap) and runs an FFT (Fast Fourier Transform) on each chunk.

    The resulting 2D array can 

    Parameters
    ----------
    x : array_like
        Input signal (expected to be real)
    Nwin : int
        Length of each window (chunk of the signal). Should be ≪ `len(x)`.
    Nfft : int, optional
        Zero-pad each chunk to this length before FFT. Should be ≥ `Nwin`,
        (usually with small prime factors, for fastest FFT). Default: `Nwin`.

    Returns
    -------
    out : complex ndarray
        `len(x) // Nwin` by `Nfft` complex array representing the STFT of `x`.

    See also
    --------
    istft : inverse function (convert a STFT array back to a data vector)
    stftbins : time and frequency bins corresponding to `out`
    """
    Nfft = Nfft or Nwin
    Nwindows = x.size // Nwin
    # reshape into array `Nwin` wide, and as tall as possible. This is
    # optimized for C-order (row-major) layouts.
    arr = np.reshape(x[:Nwindows * Nwin], (-1, Nwin))
    stft = fft.rfft(arr, Nfft)
    return stft


def stftbins(x, Nwin, Nfft=None, d=1.0):
    """
    Time and frequency bins corresponding to short-time Fourier transform.

    Call this with the same arguments as `stft`, plus one extra argument: `d`
    sample spacing, to get the time and frequency axes that the output of
    `stft` correspond to.

    Parameters
    ----------
    x : array_like
        same as `stft`
    Nwin : int
        same as `stft`
    Nfft : int, optional
        same as `stft`
    d : float, optional
        Sample spacing of `x` (or 1 / sample frequency), units of seconds.
        Default: 1.0.

    Returns
    -------
    t : ndarray
        Array of length `len(x) // Nwin`, in units of seconds, corresponding to
        the first dimension (height) of the output of `stft`.
    f : ndarray
        Array of length `Nfft`, in units of Hertz, corresponding to the second
        dimension (width) of the output of `stft`.
    """
    Nfft = Nfft or Nwin
    Nwindows = x.size // Nwin
    t = np.arange(Nwindows) * (Nwin * d)
    f = fft.rfftfreq(Nfft, d)
    return t, f


def istft(stftArr, Nwin):
    """
    Inverse short-time Fourier transform (ISTFT)

    Given an array representing the output of `stft`, convert it back to the
    original samples.

    Parameters
    ----------
    stftArr : ndarray
        Output of `stft` (or something the same size)
    Nwin : int
        Same input as `stft`: length of each chunk that the STFT was calculated
        over.

    Returns
    -------
    y : ndarray
        Data samples corresponding to STFT data.

    See also:
    stft : the forward transform
    """
    arr = fft.irfft(stftArr)[:, :Nwin]
    return np.reshape(arr, -1)


if __name__ == '__main__':
    sampleRate = 100.0  # Hertz
    N = 1024
    Nwin = 64

    # Generate a chirp: start frequency at 5 Hz and going down at 2 Hz/s
    time = np.arange(N) / sampleRate  # seconds
    x = np.cos(2 * np.pi * time * (5 - 2 * 0.5 * time))

    # Test with Nfft bigger than Nwin
    Nfft = Nwin * 2
    s = stft(x, Nwin, Nfft=Nfft)
    y = istft(s, Nwin)

    # Make sure the stft and istft are inverses. Caveat: `x` and `y` won't be
    # the same length if `N/Nwin` isn't integral!
    maxerr = np.max(np.abs(x - y))
    assert (maxerr < np.spacing(1) * 10)

    # Test `stftbins`
    t, f = stftbins(x, Nwin, Nfft=Nfft, d=1 / sampleRate)
    assert (len(t) == s.shape[0])
    assert (len(f) == s.shape[1])

    try:
        import pylab as plt
        plt.imshow(np.abs(s), aspect="auto", extent=[f[0], f[-1], t[-1], t[0]])
        plt.xlabel('frequency (Hertz)')
        plt.ylabel('time (seconds (start of chunk))')
        plt.title('STFT with chirp example')
        plt.show()
    except ModuleNotFoundError:
        pass

This is in a gist if that's easier for you to read.

The entire module assumes real-only data and uses Numpy's rfft functions. You can definitely generalize this to complex data (or use librosa), but for your application (audio masking), using the real-only transforms makes it easier to ensure that everything works out and the output of the inverse STFT is real-only (it's easy to mess this up if you're doing the fully-general complex STFT, where you need to be careful in maintaining symmetries).

The demo first generates some test data and confirms that the istft on the stft of the data produces the data again. The test data is a chirp that starts at 5 Hz and goes down at 2 Hz per second, so over ~10 seconds of data, the chirp's frequency wraps around and ends up at around 15 Hz. The demo plots the STFT (by taking the absolute value of the STFT array):

put this code in a stft.py file,
import it as import stft,
compute an STFT as spectrum = stft.stft(y, 128),
visualize your spectrum as shown in the demo (be sure to prepend stft. to functions defined in stft.py!),
pick what frequencies you want to attenuate/amplify and apply those effects on the spectrum array, before
finally getting the processed audio via back_y = stft.istft(spectrum, 128).

Masking/amplifying/attenuating frequency content means just scaling some bins of the spectrum array. If you have specific questions on how to do that, let us know. But this hopefully will give you a foolproof way of applying arbitrary effects.

If you really want to use librosa's functions, let us know and we can show you how to do that too.

来源：https://stackoverflow.com/questions/51655119/how-do-i-apply-a-binary-mask-and-stft-to-produce-an-audio-file

标签

python

audio

linear-algebra

spectrogram

librosa