How to handle homophones in speech recognition?

自作多情 提交于 2020-07-05 10:43:08

问题


For those who are not familiar with what a homophone is, I provide the following examples:

  • our & are
  • hi & high
  • to & too & two

While using the Speech API included with iOS, I am encountering situations where a user may say one of these words, but it will not always return the word I want.

I looked into the [alternativeSubstrings] (link) property wondering if this would help, but in my testing of the above words, it always comes back empty.

I also looked into the Natural Language API, but could not find anything in there that looked useful.

I understand that as a user adds more words, the Speech API can begin to infer context and correct for these, but my use case will not work well with this since it will often only want one or two words at most, limiting the effectiveness of context.

An example of contextual processing:

Using the words above on their own, I get these results:

  • are
  • hi
  • to

However, if I put together the following sentence, you can see they are all wrong:

I am too high for our ladder

Ideally, I would either get a list back containing [are, our], [to, too, two], [hi, high] for each transcription segment, or would have a way to compare a string against a function that supports homophones.

An example of this would be:

if myDetectedWord == "to" then { ... }

Where myDetectedWord can be [to, too, two], and this function would return true for each of these.


回答1:


This is a common NLP dilemma, and I'm not so sure what might be your desired output in this application. However, you may want to bypass this problem in your design/architecture process, if possible and if you could. Otherwise, this problem is to turn into a challenge.


Being said that, if you wish to really get into it, I like this idea of yours:

string against a function

This might be more efficient and performance friendly.

One way, I'd be liking to solve this problem would be though RegEx processing, instead of using endless loops and arrays. You could maybe prototype loops and arrays to begin with and see how it works, then you might want to use regular expression for gaining performance.

You could for instance define fixed arrays in regular expressions and quickly check against your string (word by word, maybe using back-referencing) and you can add many boundaries in your expressions for string processing, as you wish.

Your fixed arrays also can be designed based on probabilities of occurring certain words in certain part of a string. For instance,

^I 

vs

^eye
  • The probability of I being the first word is much higher than that of eye.
  • The probability of I in any part of a string is higher than that of eye, also.

You might want to weight words based on that.

I'd say the key would be that you'd narrow down your desired outputs as focused as possible and increase accuracy, [maybe even with 100 words if possible], if you wish to have a good/working application.

Good project though, I hope you like/enjoy the challenge.



来源:https://stackoverflow.com/questions/56092593/how-to-handle-homophones-in-speech-recognition

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!