I am having challenges extracting the actual words and timestamps of the word from a given audio, though I have achieve this using google api, but needed an