train a language model using Google Ngrams

痞子三分冷 提交于 2020-01-23 18:00:08

问题


I want to find a conditional probability of a word given its previous set of words. I plan to use Google N-grams for the same. However, being such a huge resource as it is, I don't think it is computationally possible to do on my PC. ( To process all N-grams, to train a language model).

So is there any way I can train a language model using Google Ngrams ? (Even python NLTK library does not support ngram language model anymore) Note - I know that a language model can be trained using ngrams, but given the vast size of Google N grams, how can a language model be trained using specifically Google ngrams?


回答1:


You ought to check out this slick code base from UC Berkley: https://github.com/adampauls/berkeleylm

In the examples/ folder, you will find a bash script make-binary-from-google.sh that creates a compact language model from the raw Google N-Grams. The resulting LM implements stupid backoff and utilizes a fast and efficient data structure described in the following paper: http://nlp.cs.berkeley.edu/pubs/Pauls-Klein_2011_LM_paper.pdf

If you are just interested in the final trained LM, you can download it in a variety of languages from the Berkley-hosted website: http://tomato.banatao.berkeley.edu:8080/berkeleylm_binaries/



来源:https://stackoverflow.com/questions/38264636/train-a-language-model-using-google-ngrams

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!