I need to find some literature in how to compare a realtime recorded voice (From a mic) against a database of pre-recorded voices. After comparing I would then need to output a
No expert in this field (so handle accordingly) but you should look at:
How to approach?
filter voices
recognizable speech minimum is up to 0.4-3.4 KHz
(that is why these are used in old phone filters). Human voice is usually up to 12.7 KHz
so if you are sure you have unfiltered recordings then filter up to 12.7 KHz
and also take out the 50Hz
or 60Hz
from power lines
Make the dataset
if you have recording of the same sentence to compare then you can just compute spectrum via DFFT or DFCT of the same tone/letter (for example start,middle,end). Filter out unused areas, make voice print dataset from the data. If not then you need to find similar tones/letters in recordings first for that you need speech recognition to be sure or find parts in recording that have similar properties. What they are you have to learn (by trial, or by researching speech recognition papers) here some hints: tempo,dynamic volume range,frequency ranges.
compare dataset
numeric comparison is done by correlation coefficient which is pretty straightforward (and mine favorite) you can also use neural network for this (even bullet 2) also may be there is some FUZZY approach for this. I recommend to use correlation because its output is similar to what you want and it is deterministic so there are no problems with over/under learning or invalid architecture,etc ...
[edit1]
People are using also Furmant filters to generate vocals and speech. Their properties mimics human vocalization paths and the math behind them can be also used in speech recognition by inspecting the major frequencies of the filter you can detect vocal, intonation, tempo ... Which might be used for speech detection directly. However that is way outside my field of expertise but there are many papers about this out there so just google ...