Output results in conll format (POS-tagging, stanford pos tagger)

拟墨画扇 提交于 2019-12-06 07:20:24

问题


I am trying to use Stanford POS-tagger, I want to ask if it is possible to parse (actually only pos tag would be enough) an english text and output the results in conll format. Is there such an option?

I am using the full 3.2.0 version of the Stanford pos tagger

Thanks a lot


回答1:


When it comes to the CONLL format, i presume you mean the CONLL2000 chunking task format as such:

   He        PRP  B-NP
   reckons   VBZ  B-VP
   the       DT   B-NP
   current   JJ   I-NP
   account   NN   I-NP
   deficit   NN   I-NP
   will      MD   B-VP
   narrow    VB   I-VP
   to        TO   B-PP
   only      RB   B-NP
   #         #    I-NP
   1.8       CD   I-NP
   billion   CD   I-NP
   in        IN   B-PP
   September NNP  B-NP
   .         .    O

There are three columns in the CONLL chunking task format:

  1. token (i.e. word)
  2. POS tag
  3. BIO (begin, inside, outside) of chunk/phrase tag

Sadly, if you use the stanford MaxEnt tagger, it only give you the token and POS information but has no BIO chunk information.

java -cp stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/left3words-wsj-0-18.tagger -textFile short.txt -outputFormat tsv 2> /dev/null

Using the above command the Stanford POS tagger already give you the tab separated format, just that it's without the 3rd column (see http://nlp.stanford.edu/software/pos-tagger-faq.shtml):

   He        PRP
   reckons   VBZ
   the       DT
   ...

To get the BIO colum, you would require either:

  • a statistical chunker or
  • a full parser

see http://www-nlp.stanford.edu/links/statnlp.html for a list of chunker/parser, if you want to stick with stanford tools, i suggest the stanford parser but it gives you the bracketed parse format, which you have to do some post-processing to get it into CONLL2000 format, see http://nlp.stanford.edu/software/lex-parser.shtml



来源:https://stackoverflow.com/questions/18948712/output-results-in-conll-format-pos-tagging-stanford-pos-tagger

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!