I am working with the BLLIP 1987-89 WSJ Corpus Release 1 (https://catalog.ldc.upenn.edu/LDC2000T43).
I am trying to use NLTK's SyntaxCorpusReader class to read in the parsed sentences. I'm trying to get it to work with a simple example of just 1 file. Here is my code...
from nltk.corpus.reader import SyntaxCorpusReader path = '/corpus/wsj' filename = 'wsj1' reader = SyntaxCorpusReader('/corpus/wsj','wsj1')
I am able to see the raw text from the file. It returns a string of the parsed sentences.
reader.raw() u"(S1 (S (PP-LOC (IN In)\n\t(NP (NP (DT a) (NN move))\n\t (SBAR (WHNP#0 (WDT that))\n\t (S (NP-SBJ (-NONE- *T*-0))\n\t (VP (MD would)\n\t (VP (VB represent)\n\t (NP (NP (DT a) (JJ major) (NN break))\n\t (PP (IN with) (NP (NN tradition))))\n\t (PP-LOC (IN in)\n\t (NP#1004 (DT the) (JJ legal) (NN profession)))))))))\n (, ,)\n (NP-SBJ#1005 (NP (NN law) (NNS firms))\n (PP-LOC (IN in) (NP#1006 (DT this) (NN city))))\n (VP (MD may)\n (VP (VB become)\n (NP (NP (DT the) (JJ first))\n\t(PP-LOC (IN in) (NP (DT the) (NN nation)))\n\t(SBAR (WHNP#1 (-NONE- 0))\n\t (S (NP-SBJ (-NONE- *T*-1))\n\t (VP (TO to)\n\t (VP (VB reward)\n\t (NP#1009 (NNS non-lawyers))\n\t (PP-MNR-CLR (IN with)\n\t (NP#1010 (NP (DT the) (VBN cherished) (NN title))\n\t (PP (IN of) (NP (NN partner))))))))))))\n (. .)))\n...'
But when I try to get the parsed sentences, I receive an error.
reader.parsed_sents() File "<stdin>", line 1, in <module> File "/usr/lib/python2.7/dist-packages/nltk/compat.py", line 487, in wrapper return method(self).encode('ascii', 'backslashreplace') File "/usr/lib/python2.7/dist-packages/nltk/util.py", line 664, in __repr__ for elt in self: File "/usr/lib/python2.7/dist-packages/nltk/corpus/reader/util.py", line 291, in iterate_from tokens = self.read_block(self._stream) File "/usr/lib/python2.7/dist-packages/nltk/corpus/reader/api.py", line 430, in _read_parsed_sent_block return list(filter(None, [self._parse(t) for t in self._read_block(stream)])) File "/usr/lib/python2.7/dist-packages/nltk/corpus/reader/api.py", line 378, in _read_block raise NotImplementedError() NotImplementedError
I'm not sure what the issue is. My goal was to read in the parsed sentences and use NLTK's tree class to extract the text of the sentences, and perhaps navigate the tree structure.