问题
I try to build a POS-tagged corpus from external .txt
files for chunking and entity and relation extraction. So far I have found a cumbersome multistep solution:
Read files with into a plain text corpus:
from nltk.corpus.reader import PlaintextCorpusReader my_corp = PlaintextCorpusReader(".", r".*\.txt")
Tag corpus with built-in Penn POS-tagger:
my_tagged_corp= nltk.batch_pos_tag(my_corp.sents())
(By the way, at this pont Python threw an error: NameError: name 'batch' is not defined
)
Write out tagged sentences into file:
taggedfile = open("output.txt" , "w") for sent in dd_tagged: line = " ".join( w+"/"+t for (w, t) in sent ) taggedfile.write(line + "\n") taggedfile.close ()
And finally, read this output again as tagged corpus:
from nltk.corpus.reader import TaggedCorpusReader my_corpus2 = TaggedCorpusReader(".",r"output.txt")
That is all very inconvenient for a quite common task (chunking always requests tagged corpus). My question is: is there a more compact and elegant way to implement this? A corpus reader that gets raw input files and a tagger at the same time for instance?
回答1:
I got the working solution for this: Kindly refer to link for step by step procedure.
Download necessary files for the same from here.
Once you follow commands from 1 pickle file will be generated and this is your tagged corpus.
Once pickle file is generated you can check whether your tagger is working fine by running following piece of code:
import nltk.data
tagger = nltk.data.load("taggers/NAME_OF_TAGGER.pickle")
tagger.tag(['some', 'words', 'in', 'a', 'sentence'])
来源:https://stackoverflow.com/questions/38020141/how-to-build-pos-tagged-corpus-with-nltk