问题
I need to build a software that does audio recognition from a small audio sample (A) inside other audio samples (B), and output how many times A appears inside the audio from B (if there is a match).
What I have: A database with hundreds of audios
Input: New audio
Expected Output: A boolean if the input matches a sample from the database, and how many times appeared the input inside the matched audio (from the db).
Any code, open source project, guides, books, videos, tutorial, etc... is useful! Thanks everyone!
回答1:
This is a very broad question, but let me try to back up and describe a little bit about how audio recognition works generally, and how you might perform this yourself.
I'm going to assume the audio comes from an audio file and not a stream, but it should be relatively easy to understand either way.
The Basics of Digital Audio
An audio file is a series of samples which are recorded into a device through a process called sampling. Sampling is the process by which a continuous analog signal (for instance, the electrical signal from a microphone or an electric guitar) is turned into a discrete, digital signal.
With audio signals, sampling is almost always done at a single sampling rate, which is generally somewhere between 8kHz and 192kHz. The only particularly important things to know about sampling for you are:
- The highest frequency that a digital audio system can represent is called the nyquist rate, which is half the sampling rate. So if you're using a sampling rate of 48kHz, the highest possible represented frequency is 24kHz. This is generally plenty because humans can only hear up to 20kHz, so you're safe to use any sampling rate over 40kHz unless you're trying to record something that isn't for humans.
- After being sampled, the digital audio file is stored in terms of either floating point or integer values. Most often, an audio file is represented as either 32-bit floating, 24-bit integer, or 16-bit integer. In any case, most modern audio processing is done with floating point numbers, and is generally scaled within the window (-1.0, 1.0). In this system, alternating -1.0s and 1.0s is the loudest possible square wave at the highest possible frequency, and a series of 0.0s is silence.
Audio Recognition
General algorithms for audio recognition are complex and often inefficient relative to a certain amount of use cases. For instance, are you trying to determine whether an audio file exactly matches another audio file, or whether they would sound nearly identical? For instance, let's look at the simplest audio comparison algorithm (at least the simplest I can come up with).
def compareAudioFiles(a, b):
if len(a) != len(b):
return False
for idx in range(len(a)):
# if the current item in a isn't equal to the current item in b
if a[idx] != b[idx]:
return False
return True # if the two above returns aren't triggered, a and b are the same.
This works **only under specific circumstances* -- if the audio files are even slightly different, they won't be matched as identical. Let's talk about a few ways that this could fail:
- Floating point comparison -- it is risky to use
==
between floats because floats are compared with such accuracy that tiny changes to the samples would cause them to register as different. For instance:
SamplesA = librosa.core.load('audio_file_A.wav')
SamplesB = librosa.core.load('audio_file_A.wav')
SamplesB[0] *= 1.0...00000001 # replace '...' with lots of zeros
compareAudioFiles(SamplesA, SamplesB) # will be false.
Even though the slight change to SamplesB
is imperceivable, it is recognized by compareAudioFiles
.
- Zero padding -- a single sample of 0 before or after the file will cause failure:
SamplesA = librosa.core.load('audio_file_A.wav')
SamplesB = numpy.append(SamplesA, 0) # adds one zero to the end
# will be False because len(SamplesA) != len(samplesB)
compareAudioFiles(SamplesA, SamplesB) # False
There are tons of other reasons this wouldn't work, like phase mismatch, bias, and filtered low frequency or high frequency signals which aren't audible.
You could continue to improve this algorithm to make up for some things like these, but it would still probably never work well enough to match perceived sounds to others. In short, if you want to do this in such a way which compares the way audio sounds you need to use an acoustic fingerprinting library. One such library is pyacoustid. Otherwise, if you want to compare audio samples from files on their own, you can probably come up with a relatively stable algorithm which measures the difference between sounds in the time domain, taking into account zero padding, imprecision, bias, and other noise.
For general purpose audio operations in Python, I'd recommend LibROSA
Good luck!
来源:https://stackoverflow.com/questions/61760505/detecting-audio-inside-audio-audio-recognition