How to build POS-tagged corpus with NLTK?

穿精又带淫゛_ 提交于 2019-12-12 14:23:52

问题


I try to build a POS-tagged corpus from external .txt files for chunking and entity and relation extraction. So far I have found a cumbersome multistep solution:

  1. Read files with into a plain text corpus:

    from nltk.corpus.reader import PlaintextCorpusReader
    my_corp = PlaintextCorpusReader(".", r".*\.txt")
    
  2. Tag corpus with built-in Penn POS-tagger:

    my_tagged_corp= nltk.batch_pos_tag(my_corp.sents())
    

(By the way, at this pont Python threw an error: NameError: name 'batch' is not defined)

  1. Write out tagged sentences into file:

    taggedfile = open("output.txt" , "w")
    for sent in dd_tagged:
        line = " ".join( w+"/"+t for (w, t) in sent )
    taggedfile.write(line + "\n")
    taggedfile.close ()
    
  2. And finally, read this output again as tagged corpus:

    from nltk.corpus.reader import TaggedCorpusReader
    my_corpus2 = TaggedCorpusReader(".",r"output.txt")
    

That is all very inconvenient for a quite common task (chunking always requests tagged corpus). My question is: is there a more compact and elegant way to implement this? A corpus reader that gets raw input files and a tagger at the same time for instance?


回答1:


I got the working solution for this: Kindly refer to link for step by step procedure.

Download necessary files for the same from here.

Once you follow commands from 1 pickle file will be generated and this is your tagged corpus.

Once pickle file is generated you can check whether your tagger is working fine by running following piece of code:

import nltk.data
tagger = nltk.data.load("taggers/NAME_OF_TAGGER.pickle")
tagger.tag(['some', 'words', 'in', 'a', 'sentence'])


来源:https://stackoverflow.com/questions/38020141/how-to-build-pos-tagged-corpus-with-nltk

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!