I need to identify the \"quality\" of the user\'s pronunciation with the help of Microsoft speech SDK (System.Speech.Recognition
). I am using MS Speech Engine - US,
Ok, here's how I'd approach the problem.
First, load up the dictation engine with the Pronunciation topic, which will return the phonemes spoken by the user (in the Recognition event).
Second, get the reference phonemes for the word using the ISpEnginePronunciation::GetPronunciations method (as I outlined here).
Once you have the two sets of phonemes, you can compare them. Essentially, the phonemes are separated by spaces, and each phoneme is represented by a short tag (described in the American English Phoneme Representation spec).
Given this, you should be able to compute a score by comparing the phonemes by any number of approximate string matching schemes (e.g., Levenshtein distance).
You might find the problem simpler by comparing phone IDs rather than strings; ISpPhoneConverter::PhoneToId can convert the phoneme strings to an array of phoneIDs, one ID per phoneme. That would give you a pair of null-terminated integer arrays, perhaps better suited for your comparison algorithm.
You could use the engine confidence to penalize matches, as low engine confidence indicates that the incoming audio doesn't closely match the engine's idea of the phoneme.