Is it possible to use Stanford Parser in NLTK? (I am not talking about Stanford POS.)
As of NLTK v3.3, users should avoid the Stanford NER or POS taggers from nltk.tag
, and avoid Stanford tokenizer/segmenter from nltk.tokenize
.
Instead use the new nltk.parse.corenlp.CoreNLPParser
API.
Please see https://github.com/nltk/nltk/wiki/Stanford-CoreNLP-API-in-NLTK
(Avoiding link only answer, I've pasted the docs from NLTK github wiki below)
First, update your NLTK
pip3 install -U nltk # Make sure is >=3.3
Then download the necessary CoreNLP packages:
cd ~
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-02-27.zip
unzip stanford-corenlp-full-2018-02-27.zip
cd stanford-corenlp-full-2018-02-27
# Get the Chinese model
wget http://nlp.stanford.edu/software/stanford-chinese-corenlp-2018-02-27-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-chinese.properties
# Get the Arabic model
wget http://nlp.stanford.edu/software/stanford-arabic-corenlp-2018-02-27-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-arabic.properties
# Get the French model
wget http://nlp.stanford.edu/software/stanford-french-corenlp-2018-02-27-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-french.properties
# Get the German model
wget http://nlp.stanford.edu/software/stanford-german-corenlp-2018-02-27-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-german.properties
# Get the Spanish model
wget http://nlp.stanford.edu/software/stanford-spanish-corenlp-2018-02-27-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-spanish.properties
Still in the stanford-corenlp-full-2018-02-27
directory, start the server:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-preload tokenize,ssplit,pos,lemma,ner,parse,depparse \
-status_port 9000 -port 9000 -timeout 15000 &
Then in Python:
>>> from nltk.parse import CoreNLPParser
# Lexical Parser
>>> parser = CoreNLPParser(url='http://localhost:9000')
# Parse tokenized text.
>>> list(parser.parse('What is the airspeed of an unladen swallow ?'.split()))
[Tree('ROOT', [Tree('SBARQ', [Tree('WHNP', [Tree('WP', ['What'])]), Tree('SQ', [Tree('VBZ', ['is']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['airspeed'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['an']), Tree('JJ', ['unladen'])])]), Tree('S', [Tree('VP', [Tree('VB', ['swallow'])])])])]), Tree('.', ['?'])])])]
# Parse raw string.
>>> list(parser.raw_parse('What is the airspeed of an unladen swallow ?'))
[Tree('ROOT', [Tree('SBARQ', [Tree('WHNP', [Tree('WP', ['What'])]), Tree('SQ', [Tree('VBZ', ['is']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['airspeed'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['an']), Tree('JJ', ['unladen'])])]), Tree('S', [Tree('VP', [Tree('VB', ['swallow'])])])])]), Tree('.', ['?'])])])]
# Neural Dependency Parser
>>> from nltk.parse.corenlp import CoreNLPDependencyParser
>>> dep_parser = CoreNLPDependencyParser(url='http://localhost:9000')
>>> parses = dep_parser.parse('What is the airspeed of an unladen swallow ?'.split())
>>> [[(governor, dep, dependent) for governor, dep, dependent in parse.triples()] for parse in parses]
[[(('What', 'WP'), 'cop', ('is', 'VBZ')), (('What', 'WP'), 'nsubj', ('airspeed', 'NN')), (('airspeed', 'NN'), 'det', ('the', 'DT')), (('airspeed', 'NN'), 'nmod', ('swallow', 'VB')), (('swallow', 'VB'), 'case', ('of', 'IN')), (('swallow', 'VB'), 'det', ('an', 'DT')), (('swallow', 'VB'), 'amod', ('unladen', 'JJ')), (('What', 'WP'), 'punct', ('?', '.'))]]
# Tokenizer
>>> parser = CoreNLPParser(url='http://localhost:9000')
>>> list(parser.tokenize('What is the airspeed of an unladen swallow?'))
['What', 'is', 'the', 'airspeed', 'of', 'an', 'unladen', 'swallow', '?']
# POS Tagger
>>> pos_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='pos')
>>> list(pos_tagger.tag('What is the airspeed of an unladen swallow ?'.split()))
[('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'), ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'), ('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]
# NER Tagger
>>> ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
>>> list(ner_tagger.tag(('Rami Eid is studying at Stony Brook University in NY'.split())))
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'STATE_OR_PROVINCE')]
Start the server a little differently, still from the `stanford-corenlp-full-2018-02-27 directory:
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-chinese.properties \
-preload tokenize,ssplit,pos,lemma,ner,parse \
-status_port 9001 -port 9001 -timeout 15000
In Python:
>>> parser = CoreNLPParser('http://localhost:9001')
>>> list(parser.tokenize(u'我家没有电脑。'))
['我家', '没有', '电脑', '。']
>>> list(parser.parse(parser.tokenize(u'我家没有电脑。')))
[Tree('ROOT', [Tree('IP', [Tree('IP', [Tree('NP', [Tree('NN', ['我家'])]), Tree('VP', [Tree('VE', ['没有']), Tree('NP', [Tree('NN', ['电脑'])])])]), Tree('PU', ['。'])])])]
Start the server:
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-arabic.properties \
-preload tokenize,ssplit,pos,parse \
-status_port 9005 -port 9005 -timeout 15000
In Python:
>>> from nltk.parse import CoreNLPParser
>>> parser = CoreNLPParser('http://localhost:9005')
>>> text = u'انا حامل'
# Parser.
>>> parser.raw_parse(text)
>>> list(parser.raw_parse(text))
[Tree('ROOT', [Tree('S', [Tree('NP', [Tree('PRP', ['انا'])]), Tree('NP', [Tree('NN', ['حامل'])])])])]
>>> list(parser.parse(parser.tokenize(text)))
[Tree('ROOT', [Tree('S', [Tree('NP', [Tree('PRP', ['انا'])]), Tree('NP', [Tree('NN', ['حامل'])])])])]
# Tokenizer / Segmenter.
>>> list(parser.tokenize(text))
['انا', 'حامل']
# POS tagg
>>> pos_tagger = CoreNLPParser('http://localhost:9005', tagtype='pos')
>>> list(pos_tagger.tag(parser.tokenize(text)))
[('انا', 'PRP'), ('حامل', 'NN')]
# NER tag
>>> ner_tagger = CoreNLPParser('http://localhost:9005', tagtype='ner')
>>> list(ner_tagger.tag(parser.tokenize(text)))
[('انا', 'O'), ('حامل', 'O')]
Start the server:
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-french.properties \
-preload tokenize,ssplit,pos,parse \
-status_port 9004 -port 9004 -timeout 15000
In Python:
>>> parser = CoreNLPParser('http://localhost:9004')
>>> list(parser.parse('Je suis enceinte'.split()))
[Tree('ROOT', [Tree('SENT', [Tree('NP', [Tree('PRON', ['Je']), Tree('VERB', ['suis']), Tree('AP', [Tree('ADJ', ['enceinte'])])])])])]
>>> pos_tagger = CoreNLPParser('http://localhost:9004', tagtype='pos')
>>> pos_tagger.tag('Je suis enceinte'.split())
[('Je', 'PRON'), ('suis', 'VERB'), ('enceinte', 'ADJ')]
Start the server:
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-german.properties \
-preload tokenize,ssplit,pos,ner,parse \
-status_port 9002 -port 9002 -timeout 15000
In Python:
>>> parser = CoreNLPParser('http://localhost:9002')
>>> list(parser.raw_parse('Ich bin schwanger'))
[Tree('ROOT', [Tree('NUR', [Tree('S', [Tree('PPER', ['Ich']), Tree('VAFIN', ['bin']), Tree('AP', [Tree('ADJD', ['schwanger'])])])])])]
>>> list(parser.parse('Ich bin schwanger'.split()))
[Tree('ROOT', [Tree('NUR', [Tree('S', [Tree('PPER', ['Ich']), Tree('VAFIN', ['bin']), Tree('AP', [Tree('ADJD', ['schwanger'])])])])])]
>>> pos_tagger = CoreNLPParser('http://localhost:9002', tagtype='pos')
>>> pos_tagger.tag('Ich bin schwanger'.split())
[('Ich', 'PPER'), ('bin', 'VAFIN'), ('schwanger', 'ADJD')]
>>> pos_tagger = CoreNLPParser('http://localhost:9002', tagtype='pos')
>>> pos_tagger.tag('Ich bin schwanger'.split())
[('Ich', 'PPER'), ('bin', 'VAFIN'), ('schwanger', 'ADJD')]
>>> ner_tagger = CoreNLPParser('http://localhost:9002', tagtype='ner')
>>> ner_tagger.tag('Donald Trump besuchte Angela Merkel in Berlin.'.split())
[('Donald', 'PERSON'), ('Trump', 'PERSON'), ('besuchte', 'O'), ('Angela', 'PERSON'), ('Merkel', 'PERSON'), ('in', 'O'), ('Berlin', 'LOCATION'), ('.', 'O')]
Start the server:
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-spanish.properties \
-preload tokenize,ssplit,pos,ner,parse \
-status_port 9003 -port 9003 -timeout 15000
In Python:
>>> pos_tagger = CoreNLPParser('http://localhost:9003', tagtype='pos')
>>> pos_tagger.tag(u'Barack Obama salió con Michael Jackson .'.split())
[('Barack', 'PROPN'), ('Obama', 'PROPN'), ('salió', 'VERB'), ('con', 'ADP'), ('Michael', 'PROPN'), ('Jackson', 'PROPN'), ('.', 'PUNCT')]
>>> ner_tagger = CoreNLPParser('http://localhost:9003', tagtype='ner')
>>> ner_tagger.tag(u'Barack Obama salió con Michael Jackson .'.split())
[('Barack', 'PERSON'), ('Obama', 'PERSON'), ('salió', 'O'), ('con', 'O'), ('Michael', 'PERSON'), ('Jackson', 'PERSON'), ('.', 'O')]