可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am working with the BLLIP 1987-89 WSJ Corpus Release 1 (https://catalog.ldc.upenn.edu/LDC2000T43).

I am trying to use NLTK's SyntaxCorpusReader class to read in the parsed sentences. I'm trying to get it to work with a simple example of just 1 file. Here is my code...

from nltk.corpus.reader import SyntaxCorpusReader  path = '/corpus/wsj' filename = 'wsj1' reader = SyntaxCorpusReader('/corpus/wsj','wsj1')

I am able to see the raw text from the file. It returns a string of the parsed sentences.

reader.raw() u"(S1 (S (PP-LOC (IN In)\n\t(NP (NP (DT a) (NN move))\n\t (SBAR (WHNP#0 (WDT that))\n\t  (S (NP-SBJ (-NONE- *T*-0))\n\t   (VP (MD would)\n\t    (VP (VB represent)\n\t     (NP (NP (DT a) (JJ major) (NN break))\n\t      (PP (IN with) (NP (NN tradition))))\n\t     (PP-LOC (IN in)\n\t      (NP#1004 (DT the) (JJ legal) (NN profession)))))))))\n     (, ,)\n     (NP-SBJ#1005 (NP (NN law) (NNS firms))\n      (PP-LOC (IN in) (NP#1006 (DT this) (NN city))))\n     (VP (MD may)\n      (VP (VB become)\n       (NP (NP (DT the) (JJ first))\n\t(PP-LOC (IN in) (NP (DT the) (NN nation)))\n\t(SBAR (WHNP#1 (-NONE- 0))\n\t (S (NP-SBJ (-NONE- *T*-1))\n\t  (VP (TO to)\n\t   (VP (VB reward)\n\t    (NP#1009 (NNS non-lawyers))\n\t    (PP-MNR-CLR (IN with)\n\t     (NP#1010 (NP (DT the) (VBN cherished) (NN title))\n\t      (PP (IN of) (NP (NN partner))))))))))))\n     (. .)))\n...'

But when I try to get the parsed sentences, I receive an error.

reader.parsed_sents() File "<stdin>", line 1, in <module> File "/usr/lib/python2.7/dist-packages/nltk/compat.py", line 487, in wrapper return method(self).encode('ascii', 'backslashreplace') File "/usr/lib/python2.7/dist-packages/nltk/util.py", line 664, in __repr__ for elt in self: File "/usr/lib/python2.7/dist-packages/nltk/corpus/reader/util.py", line 291, in iterate_from tokens = self.read_block(self._stream)  File "/usr/lib/python2.7/dist-packages/nltk/corpus/reader/api.py", line 430, in _read_parsed_sent_block return list(filter(None, [self._parse(t) for t in self._read_block(stream)]))  File "/usr/lib/python2.7/dist-packages/nltk/corpus/reader/api.py", line 378, in _read_block raise NotImplementedError() NotImplementedError

I'm not sure what the issue is. My goal was to read in the parsed sentences and use NLTK's tree class to extract the text of the sentences, and perhaps navigate the tree structure.

回答1:

Hah, had me going for a while there. That NotImplementedError is not a bug, it's the NLTK's way of telling you that you're using an incomplete class. SyntaxCorpusReader is an "abstract class", intended as a basis for corpora with specific complex syntax. In your case, you just need to use BracketParseCorpusReader instead:

reader = BracketParseCorpusReader('/corpus/wsj','wsj1') print(reader.parsed_sents()[0])

文章来源: How to read corpus of parsed sentences using NLTK in python?

标签

wsj

np问题