Is it possible to use Stanford Parser in NLTK? (I am not talking about Stanford POS.)
A new development of the Stanford parser based on a neural model, trained using Tensorflow is very recently made available to be used as a python API. This model is supposed to be far more accurate than the Java-based moel. You can certainly integrate with an NLTK pipeline.
Link to the parser. Ther repository contains pre-trained parser models for 53 languages.
The Stanford Core NLP software page has a list of python wrappers:
http://nlp.stanford.edu/software/corenlp.shtml#Extensions
Note that this answer applies to NLTK v 3.0, and not to more recent versions.
I cannot leave this as a comment because of reputation, but since I spent (wasted?) some time solving this I would rather share my problem/solution to get this parser to work in NLTK.
In the excellent answer from alvas, it is mentioned that:
e.g. for the Parser, there won't be a model directory.
This led me wrongly to:
STANFORD_MODELS
(and only care about my CLASSPATH
)../path/tostanford-parser-full-2015-2012-09/models directory
* virtually empty* (or with a jar file whose name did not match nltk regex)!If the OP, like me, just wanted to use the parser, it may be confusing that when not downloading anything else (no POStagger, no NER,...) and following all these instructions, we still get an error.
Eventually, for any CLASSPATH
given (following examples and explanations in answers from this thread) I would still get the error:
NLTK was unable to find stanford-parser-(\d+)(.(\d+))+-models.jar! Set the CLASSPATH environment variable. For more information, on stanford-parser-(\d+)(.(\d+))+-models.jar,
see: http://nlp.stanford.edu/software/lex-parser.shtml
OR:
NLTK was unable to find stanford-parser.jar! Set the CLASSPATH environment variable. For more information, on stanford-parser.jar, see: http://nlp.stanford.edu/software/lex-parser.shtml
Though, importantly, I could correctly load and use the parser if I called the function with all arguments and path fully specified, as in:
stanford_parser_jar = '../lib/stanford-parser-full-2015-04-20/stanford-parser.jar'
stanford_model_jar = '../lib/stanford-parser-full-2015-04-20/stanfor-parser-3.5.2-models.jar'
parser = StanfordParser(path_to_jar=stanford_parser_jar,
path_to_models_jar=stanford_model_jar)
Therefore the error came from NLTK
and how it is looking for jars using the supplied STANFORD_MODELS
and CLASSPATH
environment variables. To solve this, the *-models.jar
, with the correct formatting (to match the regex in NLTK
code, so no -corenlp-....jar) must be located in the folder designated by STANFORD_MODELS
.
Namely, I first created:
mkdir stanford-parser-full-2015-12-09/models
Then added in .bashrc
:
export STANFORD_MODELS=/path/to/stanford-parser-full-2015-12-09/models
And finally, by copying stanford-parser-3.6.0-models.jar
(or corresponding version), into:
path/to/stanford-parser-full-2015-12-09/models/
I could get StanfordParser
to load smoothly in python with the classic CLASSPATH
that points to stanford-parser.jar
. Actually, as such, you can call StanfordParser
with no parameters, the default will just work.
I took many hours and finally found a simple solution for Windows users. Basically its summarized version of an existing answer by alvas, but made easy to follow(hopefully) for those who are new to stanford NLP and are Window users.
1) Download the module you want to use, such as NER, POS etc. In my case i wanted to use NER, so i downloaded the module from http://nlp.stanford.edu/software/stanford-ner-2015-04-20.zip
2) Unzip the file.
3) Set the environment variables(classpath and stanford_modules) from the unzipped folder.
import os
os.environ['CLASSPATH'] = "C:/Users/Downloads/stanford-ner-2015-04-20/stanford-ner.jar"
os.environ['STANFORD_MODELS'] = "C:/Users/Downloads/stanford-ner-2015-04-20/classifiers/"
4) set the environment variables for JAVA, as in where you have JAVA installed. for me it was below
os.environ['JAVAHOME'] = "C:/Program Files/Java/jdk1.8.0_102/bin/java.exe"
5) import the module you want
from nltk.tag import StanfordNERTagger
6) call the pretrained model which is present in classifier folder in the unzipped folder. add ".gz" in the end for file extension. for me the model i wanted to use was english.all.3class.distsim.crf.ser
st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')
7) Now execute the parser!! and we are done!!
st.tag('Rami Eid is studying at Stony Brook University in NY'.split())
As of NLTK v3.3, users should avoid the Stanford NER or POS taggers from nltk.tag
, and avoid Stanford tokenizer/segmenter from nltk.tokenize
.
Instead use the new nltk.parse.corenlp.CoreNLPParser
API.
Please see https://github.com/nltk/nltk/wiki/Stanford-CoreNLP-API-in-NLTK
(Avoiding link only answer, I've pasted the docs from NLTK github wiki below)
First, update your NLTK
pip3 install -U nltk # Make sure is >=3.3
Then download the necessary CoreNLP packages:
cd ~
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-02-27.zip
unzip stanford-corenlp-full-2018-02-27.zip
cd stanford-corenlp-full-2018-02-27
# Get the Chinese model
wget http://nlp.stanford.edu/software/stanford-chinese-corenlp-2018-02-27-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-chinese.properties
# Get the Arabic model
wget http://nlp.stanford.edu/software/stanford-arabic-corenlp-2018-02-27-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-arabic.properties
# Get the French model
wget http://nlp.stanford.edu/software/stanford-french-corenlp-2018-02-27-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-french.properties
# Get the German model
wget http://nlp.stanford.edu/software/stanford-german-corenlp-2018-02-27-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-german.properties
# Get the Spanish model
wget http://nlp.stanford.edu/software/stanford-spanish-corenlp-2018-02-27-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-spanish.properties
Still in the stanford-corenlp-full-2018-02-27
directory, start the server:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-preload tokenize,ssplit,pos,lemma,ner,parse,depparse \
-status_port 9000 -port 9000 -timeout 15000 &
Then in Python:
>>> from nltk.parse import CoreNLPParser
# Lexical Parser
>>> parser = CoreNLPParser(url='http://localhost:9000')
# Parse tokenized text.
>>> list(parser.parse('What is the airspeed of an unladen swallow ?'.split()))
[Tree('ROOT', [Tree('SBARQ', [Tree('WHNP', [Tree('WP', ['What'])]), Tree('SQ', [Tree('VBZ', ['is']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['airspeed'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['an']), Tree('JJ', ['unladen'])])]), Tree('S', [Tree('VP', [Tree('VB', ['swallow'])])])])]), Tree('.', ['?'])])])]
# Parse raw string.
>>> list(parser.raw_parse('What is the airspeed of an unladen swallow ?'))
[Tree('ROOT', [Tree('SBARQ', [Tree('WHNP', [Tree('WP', ['What'])]), Tree('SQ', [Tree('VBZ', ['is']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['airspeed'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['an']), Tree('JJ', ['unladen'])])]), Tree('S', [Tree('VP', [Tree('VB', ['swallow'])])])])]), Tree('.', ['?'])])])]
# Neural Dependency Parser
>>> from nltk.parse.corenlp import CoreNLPDependencyParser
>>> dep_parser = CoreNLPDependencyParser(url='http://localhost:9000')
>>> parses = dep_parser.parse('What is the airspeed of an unladen swallow ?'.split())
>>> [[(governor, dep, dependent) for governor, dep, dependent in parse.triples()] for parse in parses]
[[(('What', 'WP'), 'cop', ('is', 'VBZ')), (('What', 'WP'), 'nsubj', ('airspeed', 'NN')), (('airspeed', 'NN'), 'det', ('the', 'DT')), (('airspeed', 'NN'), 'nmod', ('swallow', 'VB')), (('swallow', 'VB'), 'case', ('of', 'IN')), (('swallow', 'VB'), 'det', ('an', 'DT')), (('swallow', 'VB'), 'amod', ('unladen', 'JJ')), (('What', 'WP'), 'punct', ('?', '.'))]]
# Tokenizer
>>> parser = CoreNLPParser(url='http://localhost:9000')
>>> list(parser.tokenize('What is the airspeed of an unladen swallow?'))
['What', 'is', 'the', 'airspeed', 'of', 'an', 'unladen', 'swallow', '?']
# POS Tagger
>>> pos_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='pos')
>>> list(pos_tagger.tag('What is the airspeed of an unladen swallow ?'.split()))
[('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'), ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'), ('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]
# NER Tagger
>>> ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
>>> list(ner_tagger.tag(('Rami Eid is studying at Stony Brook University in NY'.split())))
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'STATE_OR_PROVINCE')]
Start the server a little differently, still from the `stanford-corenlp-full-2018-02-27 directory:
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-chinese.properties \
-preload tokenize,ssplit,pos,lemma,ner,parse \
-status_port 9001 -port 9001 -timeout 15000
In Python:
>>> parser = CoreNLPParser('http://localhost:9001')
>>> list(parser.tokenize(u'我家没有电脑。'))
['我家', '没有', '电脑', '。']
>>> list(parser.parse(parser.tokenize(u'我家没有电脑。')))
[Tree('ROOT', [Tree('IP', [Tree('IP', [Tree('NP', [Tree('NN', ['我家'])]), Tree('VP', [Tree('VE', ['没有']), Tree('NP', [Tree('NN', ['电脑'])])])]), Tree('PU', ['。'])])])]
Start the server:
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-arabic.properties \
-preload tokenize,ssplit,pos,parse \
-status_port 9005 -port 9005 -timeout 15000
In Python:
>>> from nltk.parse import CoreNLPParser
>>> parser = CoreNLPParser('http://localhost:9005')
>>> text = u'انا حامل'
# Parser.
>>> parser.raw_parse(text)
<list_iterator object at 0x7f0d894c9940>
>>> list(parser.raw_parse(text))
[Tree('ROOT', [Tree('S', [Tree('NP', [Tree('PRP', ['انا'])]), Tree('NP', [Tree('NN', ['حامل'])])])])]
>>> list(parser.parse(parser.tokenize(text)))
[Tree('ROOT', [Tree('S', [Tree('NP', [Tree('PRP', ['انا'])]), Tree('NP', [Tree('NN', ['حامل'])])])])]
# Tokenizer / Segmenter.
>>> list(parser.tokenize(text))
['انا', 'حامل']
# POS tagg
>>> pos_tagger = CoreNLPParser('http://localhost:9005', tagtype='pos')
>>> list(pos_tagger.tag(parser.tokenize(text)))
[('انا', 'PRP'), ('حامل', 'NN')]
# NER tag
>>> ner_tagger = CoreNLPParser('http://localhost:9005', tagtype='ner')
>>> list(ner_tagger.tag(parser.tokenize(text)))
[('انا', 'O'), ('حامل', 'O')]
Start the server:
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-french.properties \
-preload tokenize,ssplit,pos,parse \
-status_port 9004 -port 9004 -timeout 15000
In Python:
>>> parser = CoreNLPParser('http://localhost:9004')
>>> list(parser.parse('Je suis enceinte'.split()))
[Tree('ROOT', [Tree('SENT', [Tree('NP', [Tree('PRON', ['Je']), Tree('VERB', ['suis']), Tree('AP', [Tree('ADJ', ['enceinte'])])])])])]
>>> pos_tagger = CoreNLPParser('http://localhost:9004', tagtype='pos')
>>> pos_tagger.tag('Je suis enceinte'.split())
[('Je', 'PRON'), ('suis', 'VERB'), ('enceinte', 'ADJ')]
Start the server:
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-german.properties \
-preload tokenize,ssplit,pos,ner,parse \
-status_port 9002 -port 9002 -timeout 15000
In Python:
>>> parser = CoreNLPParser('http://localhost:9002')
>>> list(parser.raw_parse('Ich bin schwanger'))
[Tree('ROOT', [Tree('NUR', [Tree('S', [Tree('PPER', ['Ich']), Tree('VAFIN', ['bin']), Tree('AP', [Tree('ADJD', ['schwanger'])])])])])]
>>> list(parser.parse('Ich bin schwanger'.split()))
[Tree('ROOT', [Tree('NUR', [Tree('S', [Tree('PPER', ['Ich']), Tree('VAFIN', ['bin']), Tree('AP', [Tree('ADJD', ['schwanger'])])])])])]
>>> pos_tagger = CoreNLPParser('http://localhost:9002', tagtype='pos')
>>> pos_tagger.tag('Ich bin schwanger'.split())
[('Ich', 'PPER'), ('bin', 'VAFIN'), ('schwanger', 'ADJD')]
>>> pos_tagger = CoreNLPParser('http://localhost:9002', tagtype='pos')
>>> pos_tagger.tag('Ich bin schwanger'.split())
[('Ich', 'PPER'), ('bin', 'VAFIN'), ('schwanger', 'ADJD')]
>>> ner_tagger = CoreNLPParser('http://localhost:9002', tagtype='ner')
>>> ner_tagger.tag('Donald Trump besuchte Angela Merkel in Berlin.'.split())
[('Donald', 'PERSON'), ('Trump', 'PERSON'), ('besuchte', 'O'), ('Angela', 'PERSON'), ('Merkel', 'PERSON'), ('in', 'O'), ('Berlin', 'LOCATION'), ('.', 'O')]
Start the server:
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-spanish.properties \
-preload tokenize,ssplit,pos,ner,parse \
-status_port 9003 -port 9003 -timeout 15000
In Python:
>>> pos_tagger = CoreNLPParser('http://localhost:9003', tagtype='pos')
>>> pos_tagger.tag(u'Barack Obama salió con Michael Jackson .'.split())
[('Barack', 'PROPN'), ('Obama', 'PROPN'), ('salió', 'VERB'), ('con', 'ADP'), ('Michael', 'PROPN'), ('Jackson', 'PROPN'), ('.', 'PUNCT')]
>>> ner_tagger = CoreNLPParser('http://localhost:9003', tagtype='ner')
>>> ner_tagger.tag(u'Barack Obama salió con Michael Jackson .'.split())
[('Barack', 'PERSON'), ('Obama', 'PERSON'), ('salió', 'O'), ('con', 'O'), ('Michael', 'PERSON'), ('Jackson', 'PERSON'), ('.', 'O')]
You can use the Stanford Parsers output to create a Tree in nltk (nltk.tree.Tree).
Assuming the stanford parser gives you a file in which there is exactly one parse tree for every sentence. Then this example works, though it might not look very pythonic:
f = open(sys.argv[1]+".output"+".30"+".stp", "r")
parse_trees_text=[]
tree = ""
for line in f:
if line.isspace():
parse_trees_text.append(tree)
tree = ""
elif "(. ...))" in line:
#print "YES"
tree = tree+')'
parse_trees_text.append(tree)
tree = ""
else:
tree = tree + line
parse_trees=[]
for t in parse_trees_text:
tree = nltk.Tree(t)
tree.__delitem__(len(tree)-1) #delete "(. .))" from tree (you don't need that)
s = traverse(tree)
parse_trees.append(tree)