问题
Summary: Unable to find the model file used for Lemmatizer (english-lemmatizer.bin)
Details: OpenNLP Tools Models appears to be a comprehensive repository for the various models used by the different components of the Apache OpenNLP library. However, I am unable to find the model file en-lemmatizer.bin, which is used with the lemmatizer. The Apache OpenNLP Developer Manual provides the following code snippet for the Lemmatization step:
InputStream dictLemmatizer = null;
try (dictLemmatizer = new FileInputStream("english-lemmatizer.bin")) {
}
However, unlike other model files, I am just not able to find the location of this model file. Any pointers would be appreciated.
回答1:
The book "Natural Language Processing with Java Cookbook' by Richard M. Reese provides a good answer. For some reason en-lemmatizer.bin is not available for direct download from the web, but it can be created using the following steps:
Download and untar
apache-opennlp-1.9.0-bin.tar
(https://opennlp.apache.org/download.html)Go to the URL for the Lemmatizer Training File and save the text content as en-lemmatizer.dict
Go to the bin directory (from step 1, after untarring) and execute the following command:
opennlp LemmatizerTrainerME -model en-lemmatizer.bin -lang en -data /path/to/en-lemmatizer.dict -encoding UTF-8
Note: Be prepared to handle the following error:
Computing event counts... Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
回答2:
You want en-lemmatizer.bin and not english-lemmatizer.txt
来源:https://stackoverflow.com/questions/55391121/opennlp-unable-to-locate-the-model-file-for-lemmatizer