问题
I want to find a conditional probability of a word given its previous set of words. I plan to use Google N-grams
for the same. However, being such a huge resource as it is, I don't think it is computationally possible to do on my PC. ( To process all N-grams, to train a language model).
So is there any way I can train a language model using Google Ngrams ? (Even python NLTK
library does not support ngram
language model anymore)
Note - I know that a language model can be trained using ngrams, but given the vast size of Google N grams, how can a language model be trained using specifically Google ngrams?
回答1:
You ought to check out this slick code base from UC Berkley: https://github.com/adampauls/berkeleylm
In the examples/
folder, you will find a bash script make-binary-from-google.sh
that creates a compact language model from the raw Google N-Grams. The resulting LM implements stupid backoff and utilizes a fast and efficient data structure described in the following paper: http://nlp.cs.berkeley.edu/pubs/Pauls-Klein_2011_LM_paper.pdf
If you are just interested in the final trained LM, you can download it in a variety of languages from the Berkley-hosted website: http://tomato.banatao.berkeley.edu:8080/berkeleylm_binaries/
来源:https://stackoverflow.com/questions/38264636/train-a-language-model-using-google-ngrams