Having only two people to differentiate, if they are uttering the same word or phrase will make this much easier. I suggest starting with something simple, and only adding complexity as needed.
To begin, I'd try sample counts of the digital waveform, binned by time and magnitude or (if you have the software functionality handy) an FFT of the entire utterance. I'd consider a basic modeling process first, too, such as linear discriminant (or whatever you already have available).