Python - How to intuit word from abbreviated text using NLP?

喜欢而已 提交于 2019-12-02 19:10:52

If you cannot find an exhaustive dictionary, you could build (or download) a probabilistic language model, to generate and evaluate sentence candidates for you. It could be a character n-gram model or a neural network.

For your abbreviations, you can build a "noise model" which predicts probability of character omissions. It can learn from a corpus (you have to label it manually or half-manually) that consonants are missed less frequently than vowels.

Having a complex language model and a simple noise model, you can combine them using noisy channel approach (see e.g. the article by Jurafsky for more details), to suggest candidate sentences.

Update. I got enthusiastic about this problem and implemented this algorithm:

  • language model (character 5-gram trained on the Lord of the Rings text)
  • noise model (probability of each symbol being abbreviated)
  • beam search algorithm, for candidate phrase suggestion.

My solution is implemented in this Python notebook. With trained models, it has interface like noisy_channel('bsktball', language_model, error_model), which, by the way, returns {'basket ball': 33.5, 'basket bally': 36.0}. Dictionary values are scores of the suggestions (the lower, the better).

With other examples it works worse: for 'wtrbtl' it returns {'water but all': 23.7, 'water but ill': 24.5, 'water but lay': 24.8, 'water but let': 26.0, 'water but lie': 25.9, 'water but look': 26.6}.

For 'bwlingbl' it gives {'bwling belia': 32.3, 'bwling bell': 33.6, 'bwling below': 32.1, 'bwling belt': 32.5, 'bwling black': 31.4, 'bwling bling': 32.9, 'bwling blow': 32.7, 'bwling blue': 30.7}. However, when training on an appropriate corpus (e.g. sports magazines and blogs; maybe with oversampling of nouns), and maybe with more generous width of beam search, this model will provide more relevant suggestions.

So I've looked at a similar problem, and came across a fantastic package called PyEnchant. If you use the build in spell-checker you can get word suggestions, which would be a nice and simple solution. However it will only suggest single words (as far as I can tell), and so the situation you have:

wtrbtl = water bottle

Will not work.

Here is some code:

import enchant

wordDict = enchant.Dict("en_US")

inputWords = ['wtrbtl','bwlingbl','bsktball']
for word in inputWords:
    print wordDict.suggest(word)

The output is:

['rebuttal', 'tribute']
['bowling', 'blinding', 'blinking', 'bumbling', 'alienable', 'Nibelung']
['basketball', 'fastball', 'spitball', 'softball', 'executable', 'basketry']

Perhaps if you know what sort of abbreviations there are you can separate the string into two words, e.g.

'wtrbtl' -> ['wtr', 'btl']

There's also the Natural Language Processing Kit (NLTK), which is AMAZING, and you could use this in combination with the above code by looking at how common each suggested word is, for example.

Good luck!

One option is to go back in time and compute the Soundex Algorithm equivalent.

Soundex drops all the vowels, handles common mispronunciations and crunched up spellings. The algorithm is simplistic and used to be done by hand. The downside is that has no special word stemming or stop work support.

... abbreviations for words not in a master dictionary.

So, you're looking for a NLP model that can come up with valid English words, without having seen them before?

It is probably easier to find a more exhaustive word dictionary, or perhaps to map each word in the existing dictionary to common extensions such as +"es" or word[:-1] + "ies".

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!