How do I convert any sound signal to a list phonemes?
I.e the actual methodology and/or code to go from a digital signal to a list of phonemes that the sound recor
Accurate phoneme recognition is not easy to archive because phonemes itself are pretty loosely defined. Even in good audio the best possible systems today have about 18% phoneme error rate (you can check LSTM-RNN results on TIMIT published by Alex Graves).
In CMUSphinx phoneme recognition in Python is done like this:
from os import environ, path
from pocketsphinx.pocketsphinx import *
from sphinxbase.sphinxbase import *
MODELDIR = "../../../model"
DATADIR = "../../../test/data"
# Create a decoder with certain model
config = Decoder.default_config()
config.set_string('-hmm', path.join(MODELDIR, 'en-us/en-us'))
config.set_string('-allphone', path.join(MODELDIR, 'en-us/en-us-phone.lm.dmp'))
config.set_float('-lw', 2.0)
config.set_float('-beam', 1e-10)
config.set_float('-pbeam', 1e-10)
# Decode streaming data.
decoder = Decoder(config)
decoder.start_utt()
stream = open(path.join(DATADIR, 'goforward.raw'), 'rb')
while True:
buf = stream.read(1024)
if buf:
decoder.process_raw(buf, False, False)
else:
break
decoder.end_utt()
hypothesis = decoder.hyp()
print ('Phonemes: ', [seg.word for seg in decoder.seg()])
You need to checkout latest pocketsphinx from github in order to run this example. Result should look like this:
('Best phonemes: ', ['SIL', 'G', 'OW', 'F', 'AO', 'R', 'W', 'ER', 'D', 'T', 'AE', 'N', 'NG', 'IY', 'IH', 'ZH', 'ER', 'Z', 'S', 'V', 'SIL'])
See also the wiki page
Have a look at Allosaurus, a universal (~2000 lang) phone recognizer to give you IPA phonemes. On a sample wave file, I did downloaded the latest model and tried this in Python3.
$ python -m allosaurus.bin.download_model -m latest
$ python -m allosaurus.run -i sample.wav
æ l u s ɔ ɹ s
I need to create the function audio_to_phonemes
You're basically saying:
I need to re-implement 40 years of speech recognition research
You shouldn't be implementing this yourself (unless you're about to be a professor in the field of speech recognition and have a revolutionary new approach), but should be using one of the many existing frameworks. Have a look at sphinx / pocketsphinx!