I need to find some literature in how to compare a realtime recorded voice (From a mic) against a database of pre-recorded voices. After comparing I would then need to output a
No expert in this field (so handle accordingly) but you should look at:
How to approach?
filter voices
recognizable speech minimum is up to 0.4-3.4 KHz
(that is why these are used in old phone filters). Human voice is usually up to 12.7 KHz
so if you are sure you have unfiltered recordings then filter up to 12.7 KHz
and also take out the 50Hz
or 60Hz
from power lines
Make the dataset
if you have recording of the same sentence to compare then you can just compute spectrum via DFFT or DFCT of the same tone/letter (for example start,middle,end). Filter out unused areas, make voice print dataset from the data. If not then you need to find similar tones/letters in recordings first for that you need speech recognition to be sure or find parts in recording that have similar properties. What they are you have to learn (by trial, or by researching speech recognition papers) here some hints: tempo,dynamic volume range,frequency ranges.
compare dataset
numeric comparison is done by correlation coefficient which is pretty straightforward (and mine favorite) you can also use neural network for this (even bullet 2) also may be there is some FUZZY approach for this. I recommend to use correlation because its output is similar to what you want and it is deterministic so there are no problems with over/under learning or invalid architecture,etc ...
[edit1]
People are using also Furmant filters to generate vocals and speech. Their properties mimics human vocalization paths and the math behind them can be also used in speech recognition by inspecting the major frequencies of the filter you can detect vocal, intonation, tempo ... Which might be used for speech detection directly. However that is way outside my field of expertise but there are many papers about this out there so just google ...
I have done similar work before, so I may be the right person to describe the procedure to you.
I had pure recordings of sounds which I considered as gold standards. I had written python scripts to convert these sounds as an array of MFCC vectors. Read more about MFCCs here.
Extracting MFCCs can be considered as the first step in the processing of an audio file, that is features that are good for identifying the acoustic content. I generated MFCCs for every 10ms and had 39 attributes. So a sound file which was 5 seconds long had around 500 MFCCs each having 39 attributes.
Then I wrote an artificial neural network code on these lines . More about neural network can be read from here.
Then I train the neural network's weights and bias known commonly as the network parameters using the stochastic gradient descent algorithm trained using the back propagation algorithm. The trained model was then saved to identify unknown sounds.
The new sounds were then represented as a sequence of MFCC vectors and given as input to the neural network. The neural network is able to predict for each MFCC instance obtained from the new sound file into one of the sound classes that the neural network is trained on. The number of correctly classified MFCC instances gives the accuracy with which the neural network was able classify the unknown sound.
Consider for example : You train your neural network on 4 types of sounds, 1. whistle, 2. car horn, 3. dog bark and 4. siren using the procedure described above.
The new sound is say a siren sound which is 5 s long. You will obtain approximately 500 MFCC instances. The trained neural network will try to classify each of the MFCC instance to one of the classes that the neural network is trained on. So you may get something like this.
30 instances were classified as whistle. 20 instances were classified as car horn/ 10 instances were classified as dog bark and the remaining instances were correctly classified as siren.
The accuracy of classification or rather the commonness between the sounds can be approximately calculated as the ratio of the number of correctly classified instances to the total number of instances which in this case would be 440 / 500 which is 88%. This field is relatively new and much work has been done before using similar machine learning algorithms like Hidden Markov Model, Support Vector Machine and more.
This problem has already been tackled before and you may find some research paper about these in google scholar.
This is definitely not a trivial problem.
If you're seriously trying to solve it, I suggest you take a close look at how speech encoders work.
A rough break-down of the steps involved:
The parameters from step 3 is a sort of "fingerprint" of the vocal tract. Typically the consonant sounds are not sufficiently different to be of substantial use (unless the vowel sounds from two individuals are very similar).
As a first and very simple step try to determine the average fundamental of the vowels and use that frequency as the signature.
Good luck,
Jens