I want to use OpenNLP in order to tokenize Thai words. I downloaded OpenNLP and Thai tokenize model and run the following
./bin/opennlp POSTagger -lang th -model
The models from your link are outdated. First you need some manual steps to convert the model.
thai.tok.bin
to token.model
In the same folder, create a file named manifest.properties
with the following contents:
Manifest-Version=1.0.
Language=th
OpenNLP-Version=1.5.0
Component-Name=TokenizerME
useAlphaNumericOptimization=false
Now you can zip the files, if you are using Linux you can use this command: zip thai.tok.bin token.model manifest.properties
Try your model:
sh bin/opennlp TokenizerME ~/Downloads/thai-token.bin/thai.tok.bin < thai_sentence.txt
Loading Tokenizer model ... done (0,097s)
กินอะไร ยังนาย
Average: 333,3 sent/s
Total: 1 sent
Runtime: 0.003s
Execution time: 0,108 seconds
Now that you have the updated tokenizer, you can do similar with the POS Tagger model.
Download the file thai.tag.bin.gz and extract to a empty folder. Rename the extracted file thai.tag.bin
to pos.model
In the same folder, create a file named manifest.properties
with the following contents:
Manifest-Version=1.0
Language=th
OpenNLP-Version=1.5.0
Component-Name=POSTaggerME
Now you can zip the files, if you are using Linux you can use this command: zip thai.pos.bin pos.model manifest.properties
Finally, we can try the two models combined:
sh bin/opennlp TokenizerME ~/Downloads/thai-token.bin/thai.tok.bin < thai_sentence.txt > thai_tokens.txt
sh bin/opennlp POSTagger ~/Downloads/pt-pos-maxent/thai.pos.bin < thai_tokens.txt
The result is:
กินอะไร_VACT ยังนาย_NCMN
Please, let me know if this is the expected result.