OpenNLP POSTagger output from command line

后端 未结 1 1097
春和景丽
春和景丽 2021-01-24 06:11

I want to use OpenNLP in order to tokenize Thai words. I downloaded OpenNLP and Thai tokenize model and run the following

./bin/opennlp POSTagger -lang th -model         


        
1条回答
  •  北荒
    北荒 (楼主)
    2021-01-24 06:23

    The models from your link are outdated. First you need some manual steps to convert the model.

    1. Download the file thai.tok.bin.gz and extract to an empty folder. Rename the extracted file thai.tok.bin to token.model
    2. In the same folder, create a file named manifest.properties with the following contents:

      Manifest-Version=1.0.  
      Language=th  
      OpenNLP-Version=1.5.0  
      Component-Name=TokenizerME  
      useAlphaNumericOptimization=false  
      
    3. Now you can zip the files, if you are using Linux you can use this command: zip thai.tok.bin token.model manifest.properties

    4. Try your model:

      sh bin/opennlp TokenizerME ~/Downloads/thai-token.bin/thai.tok.bin <  thai_sentence.txt
      
      
      
      Loading Tokenizer model ... done (0,097s)     
      กินอะไร ยังนาย     
      
      
      Average: 333,3 sent/s      
      Total: 1 sent     
      Runtime: 0.003s     
      Execution time: 0,108 seconds 
      

    Now that you have the updated tokenizer, you can do similar with the POS Tagger model.

    1. Download the file thai.tag.bin.gz and extract to a empty folder. Rename the extracted file thai.tag.bin to pos.model

    2. In the same folder, create a file named manifest.properties with the following contents:

      Manifest-Version=1.0
      Language=th
      OpenNLP-Version=1.5.0
      Component-Name=POSTaggerME
      
    3. Now you can zip the files, if you are using Linux you can use this command: zip thai.pos.bin pos.model manifest.properties

    Finally, we can try the two models combined:

    sh bin/opennlp TokenizerME ~/Downloads/thai-token.bin/thai.tok.bin < thai_sentence.txt > thai_tokens.txt
    sh bin/opennlp POSTagger ~/Downloads/pt-pos-maxent/thai.pos.bin < thai_tokens.txt
    

    The result is:

    กินอะไร_VACT ยังนาย_NCMN
    

    Please, let me know if this is the expected result.

    0 讨论(0)
提交回复
热议问题