Stanford Parser and NLTK

后端未结

关注

 18  2361

既然无缘

Is it possible to use Stanford Parser in NLTK? (I am not talking about Stanford POS.)

相关标签:

18条回答

梦如初夏

2020-11-22 02:27

A new development of the Stanford parser based on a neural model, trained using Tensorflow is very recently made available to be used as a python API. This model is supposed to be far more accurate than the Java-based moel. You can certainly integrate with an NLTK pipeline.

Link to the parser. Ther repository contains pre-trained parser models for 53 languages.

0 讨论(0)
发布评论:

提交评论
- 加载中...
盖世英雄少女心

2020-11-22 02:30

The Stanford Core NLP software page has a list of python wrappers:

http://nlp.stanford.edu/software/corenlp.shtml#Extensions

0 讨论(0)
发布评论:

提交评论
- 加载中...
伪装坚强ぢ

2020-11-22 02:30
Note that this answer applies to NLTK v 3.0, and not to more recent versions.

I cannot leave this as a comment because of reputation, but since I spent (wasted?) some time solving this I would rather share my problem/solution to get this parser to work in NLTK.

In the excellent answer from alvas, it is mentioned that:

e.g. for the Parser, there won't be a model directory.

This led me wrongly to:
- not be careful to the value I put to STANFORD_MODELS (and only care about my CLASSPATH)
- leave ../path/tostanford-parser-full-2015-2012-09/models directory * virtually empty* (or with a jar file whose name did not match nltk regex)!
If the OP, like me, just wanted to use the parser, it may be confusing that when not downloading anything else (no POStagger, no NER,...) and following all these instructions, we still get an error.

Eventually, for any CLASSPATH given (following examples and explanations in answers from this thread) I would still get the error:

NLTK was unable to find stanford-parser-(\d+)(.(\d+))+-models.jar! Set the CLASSPATH environment variable. For more information, on stanford-parser-(\d+)(.(\d+))+-models.jar,

see: http://nlp.stanford.edu/software/lex-parser.shtml

OR:

NLTK was unable to find stanford-parser.jar! Set the CLASSPATH environment variable. For more information, on stanford-parser.jar, see: http://nlp.stanford.edu/software/lex-parser.shtml

Though, importantly, I could correctly load and use the parser if I called the function with all arguments and path fully specified, as in:
```
stanford_parser_jar = '../lib/stanford-parser-full-2015-04-20/stanford-parser.jar'
stanford_model_jar = '../lib/stanford-parser-full-2015-04-20/stanfor-parser-3.5.2-models.jar'    
parser = StanfordParser(path_to_jar=stanford_parser_jar, 
                    path_to_models_jar=stanford_model_jar)
```
Solution for Parser alone:

Therefore the error came from NLTK and how it is looking for jars using the supplied STANFORD_MODELS and CLASSPATH environment variables. To solve this, the *-models.jar, with the correct formatting (to match the regex in NLTK code, so no -corenlp-....jar) must be located in the folder designated by STANFORD_MODELS.

Namely, I first created:
```
mkdir stanford-parser-full-2015-12-09/models
```
Then added in .bashrc:
```
export STANFORD_MODELS=/path/to/stanford-parser-full-2015-12-09/models
```
And finally, by copying stanford-parser-3.6.0-models.jar (or corresponding version), into:
```
path/to/stanford-parser-full-2015-12-09/models/
```
I could get StanfordParser to load smoothly in python with the classic CLASSPATH that points to stanford-parser.jar. Actually, as such, you can call StanfordParser with no parameters, the default will just work.
0 讨论(0)
发布评论:

提交评论
- 加载中...
挽巷

2020-11-22 02:32
I took many hours and finally found a simple solution for Windows users. Basically its summarized version of an existing answer by alvas, but made easy to follow(hopefully) for those who are new to stanford NLP and are Window users.

1) Download the module you want to use, such as NER, POS etc. In my case i wanted to use NER, so i downloaded the module from http://nlp.stanford.edu/software/stanford-ner-2015-04-20.zip

2) Unzip the file.

3) Set the environment variables(classpath and stanford_modules) from the unzipped folder.
```
import os
os.environ['CLASSPATH'] = "C:/Users/Downloads/stanford-ner-2015-04-20/stanford-ner.jar"
os.environ['STANFORD_MODELS'] = "C:/Users/Downloads/stanford-ner-2015-04-20/classifiers/"
```
4) set the environment variables for JAVA, as in where you have JAVA installed. for me it was below
```
os.environ['JAVAHOME'] = "C:/Program Files/Java/jdk1.8.0_102/bin/java.exe"
```
5) import the module you want
```
from nltk.tag import StanfordNERTagger
```
6) call the pretrained model which is present in classifier folder in the unzipped folder. add ".gz" in the end for file extension. for me the model i wanted to use was english.all.3class.distsim.crf.ser
```
st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')
```
7) Now execute the parser!! and we are done!!
```
st.tag('Rami Eid is studying at Stony Brook University in NY'.split())
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

孤街浪徒

2020-11-22 02:36

As of NLTK v3.3, users should avoid the Stanford NER or POS taggers from nltk.tag, and avoid Stanford tokenizer/segmenter from nltk.tokenize.

Instead use the new nltk.parse.corenlp.CoreNLPParser API.

Please see https://github.com/nltk/nltk/wiki/Stanford-CoreNLP-API-in-NLTK

(Avoiding link only answer, I've pasted the docs from NLTK github wiki below)

First, update your NLTK

pip3 install -U nltk # Make sure is >=3.3

Then download the necessary CoreNLP packages:

cd ~
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-02-27.zip
unzip stanford-corenlp-full-2018-02-27.zip
cd stanford-corenlp-full-2018-02-27

# Get the Chinese model 
wget http://nlp.stanford.edu/software/stanford-chinese-corenlp-2018-02-27-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-chinese.properties 

# Get the Arabic model
wget http://nlp.stanford.edu/software/stanford-arabic-corenlp-2018-02-27-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-arabic.properties 

# Get the French model
wget http://nlp.stanford.edu/software/stanford-french-corenlp-2018-02-27-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-french.properties 

# Get the German model
wget http://nlp.stanford.edu/software/stanford-german-corenlp-2018-02-27-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-german.properties 


# Get the Spanish model
wget http://nlp.stanford.edu/software/stanford-spanish-corenlp-2018-02-27-models.jar
wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-spanish.properties

English

Still in the stanford-corenlp-full-2018-02-27 directory, start the server:

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-preload tokenize,ssplit,pos,lemma,ner,parse,depparse \
-status_port 9000 -port 9000 -timeout 15000 &

Then in Python:

>>> from nltk.parse import CoreNLPParser

# Lexical Parser
>>> parser = CoreNLPParser(url='http://localhost:9000')

# Parse tokenized text.
>>> list(parser.parse('What is the airspeed of an unladen swallow ?'.split()))
[Tree('ROOT', [Tree('SBARQ', [Tree('WHNP', [Tree('WP', ['What'])]), Tree('SQ', [Tree('VBZ', ['is']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['airspeed'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['an']), Tree('JJ', ['unladen'])])]), Tree('S', [Tree('VP', [Tree('VB', ['swallow'])])])])]), Tree('.', ['?'])])])]

# Parse raw string.
>>> list(parser.raw_parse('What is the airspeed of an unladen swallow ?'))
[Tree('ROOT', [Tree('SBARQ', [Tree('WHNP', [Tree('WP', ['What'])]), Tree('SQ', [Tree('VBZ', ['is']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['airspeed'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['an']), Tree('JJ', ['unladen'])])]), Tree('S', [Tree('VP', [Tree('VB', ['swallow'])])])])]), Tree('.', ['?'])])])]

# Neural Dependency Parser
>>> from nltk.parse.corenlp import CoreNLPDependencyParser
>>> dep_parser = CoreNLPDependencyParser(url='http://localhost:9000')
>>> parses = dep_parser.parse('What is the airspeed of an unladen swallow ?'.split())
>>> [[(governor, dep, dependent) for governor, dep, dependent in parse.triples()] for parse in parses]
[[(('What', 'WP'), 'cop', ('is', 'VBZ')), (('What', 'WP'), 'nsubj', ('airspeed', 'NN')), (('airspeed', 'NN'), 'det', ('the', 'DT')), (('airspeed', 'NN'), 'nmod', ('swallow', 'VB')), (('swallow', 'VB'), 'case', ('of', 'IN')), (('swallow', 'VB'), 'det', ('an', 'DT')), (('swallow', 'VB'), 'amod', ('unladen', 'JJ')), (('What', 'WP'), 'punct', ('?', '.'))]]


# Tokenizer
>>> parser = CoreNLPParser(url='http://localhost:9000')
>>> list(parser.tokenize('What is the airspeed of an unladen swallow?'))
['What', 'is', 'the', 'airspeed', 'of', 'an', 'unladen', 'swallow', '?']

# POS Tagger
>>> pos_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='pos')
>>> list(pos_tagger.tag('What is the airspeed of an unladen swallow ?'.split()))
[('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'), ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'), ('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]

# NER Tagger
>>> ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
>>> list(ner_tagger.tag(('Rami Eid is studying at Stony Brook University in NY'.split())))
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'STATE_OR_PROVINCE')]

Chinese

Start the server a little differently, still from the `stanford-corenlp-full-2018-02-27 directory:

java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-chinese.properties \
-preload tokenize,ssplit,pos,lemma,ner,parse \
-status_port 9001  -port 9001 -timeout 15000

In Python:

>>> parser = CoreNLPParser('http://localhost:9001')
>>> list(parser.tokenize(u'我家没有电脑。'))
['我家', '没有', '电脑', '。']

>>> list(parser.parse(parser.tokenize(u'我家没有电脑。')))
[Tree('ROOT', [Tree('IP', [Tree('IP', [Tree('NP', [Tree('NN', ['我家'])]), Tree('VP', [Tree('VE', ['没有']), Tree('NP', [Tree('NN', ['电脑'])])])]), Tree('PU', ['。'])])])]

Arabic

Start the server:

java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-arabic.properties \
-preload tokenize,ssplit,pos,parse \
-status_port 9005  -port 9005 -timeout 15000

In Python:

>>> from nltk.parse import CoreNLPParser
>>> parser = CoreNLPParser('http://localhost:9005')
>>> text = u'انا حامل'

# Parser.
>>> parser.raw_parse(text)
<list_iterator object at 0x7f0d894c9940>
>>> list(parser.raw_parse(text))
[Tree('ROOT', [Tree('S', [Tree('NP', [Tree('PRP', ['انا'])]), Tree('NP', [Tree('NN', ['حامل'])])])])]
>>> list(parser.parse(parser.tokenize(text)))
[Tree('ROOT', [Tree('S', [Tree('NP', [Tree('PRP', ['انا'])]), Tree('NP', [Tree('NN', ['حامل'])])])])]

# Tokenizer / Segmenter.
>>> list(parser.tokenize(text))
['انا', 'حامل']

# POS tagg
>>> pos_tagger = CoreNLPParser('http://localhost:9005', tagtype='pos')
>>> list(pos_tagger.tag(parser.tokenize(text)))
[('انا', 'PRP'), ('حامل', 'NN')]


# NER tag
>>> ner_tagger = CoreNLPParser('http://localhost:9005', tagtype='ner')
>>> list(ner_tagger.tag(parser.tokenize(text)))
[('انا', 'O'), ('حامل', 'O')]

French

Start the server:

java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-french.properties \
-preload tokenize,ssplit,pos,parse \
-status_port 9004  -port 9004 -timeout 15000

In Python:

>>> parser = CoreNLPParser('http://localhost:9004')
>>> list(parser.parse('Je suis enceinte'.split()))
[Tree('ROOT', [Tree('SENT', [Tree('NP', [Tree('PRON', ['Je']), Tree('VERB', ['suis']), Tree('AP', [Tree('ADJ', ['enceinte'])])])])])]
>>> pos_tagger = CoreNLPParser('http://localhost:9004', tagtype='pos')
>>> pos_tagger.tag('Je suis enceinte'.split())
[('Je', 'PRON'), ('suis', 'VERB'), ('enceinte', 'ADJ')]

German

Start the server:

java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-german.properties \
-preload tokenize,ssplit,pos,ner,parse \
-status_port 9002  -port 9002 -timeout 15000

In Python:

>>> parser = CoreNLPParser('http://localhost:9002')
>>> list(parser.raw_parse('Ich bin schwanger'))
[Tree('ROOT', [Tree('NUR', [Tree('S', [Tree('PPER', ['Ich']), Tree('VAFIN', ['bin']), Tree('AP', [Tree('ADJD', ['schwanger'])])])])])]
>>> list(parser.parse('Ich bin schwanger'.split()))
[Tree('ROOT', [Tree('NUR', [Tree('S', [Tree('PPER', ['Ich']), Tree('VAFIN', ['bin']), Tree('AP', [Tree('ADJD', ['schwanger'])])])])])]


>>> pos_tagger = CoreNLPParser('http://localhost:9002', tagtype='pos')
>>> pos_tagger.tag('Ich bin schwanger'.split())
[('Ich', 'PPER'), ('bin', 'VAFIN'), ('schwanger', 'ADJD')]

>>> pos_tagger = CoreNLPParser('http://localhost:9002', tagtype='pos')
>>> pos_tagger.tag('Ich bin schwanger'.split())
[('Ich', 'PPER'), ('bin', 'VAFIN'), ('schwanger', 'ADJD')]

>>> ner_tagger = CoreNLPParser('http://localhost:9002', tagtype='ner')
>>> ner_tagger.tag('Donald Trump besuchte Angela Merkel in Berlin.'.split())
[('Donald', 'PERSON'), ('Trump', 'PERSON'), ('besuchte', 'O'), ('Angela', 'PERSON'), ('Merkel', 'PERSON'), ('in', 'O'), ('Berlin', 'LOCATION'), ('.', 'O')]

Spanish

Start the server:

java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-serverProperties StanfordCoreNLP-spanish.properties \
-preload tokenize,ssplit,pos,ner,parse \
-status_port 9003  -port 9003 -timeout 15000

In Python:

>>> pos_tagger = CoreNLPParser('http://localhost:9003', tagtype='pos')
>>> pos_tagger.tag(u'Barack Obama salió con Michael Jackson .'.split())
[('Barack', 'PROPN'), ('Obama', 'PROPN'), ('salió', 'VERB'), ('con', 'ADP'), ('Michael', 'PROPN'), ('Jackson', 'PROPN'), ('.', 'PUNCT')]
>>> ner_tagger = CoreNLPParser('http://localhost:9003', tagtype='ner')
>>> ner_tagger.tag(u'Barack Obama salió con Michael Jackson .'.split())
[('Barack', 'PERSON'), ('Obama', 'PERSON'), ('salió', 'O'), ('con', 'O'), ('Michael', 'PERSON'), ('Jackson', 'PERSON'), ('.', 'O')]

0 讨论(0)

北荒

2020-11-22 02:37

You can use the Stanford Parsers output to create a Tree in nltk (nltk.tree.Tree).

Assuming the stanford parser gives you a file in which there is exactly one parse tree for every sentence. Then this example works, though it might not look very pythonic:

f = open(sys.argv[1]+".output"+".30"+".stp", "r")
parse_trees_text=[]
tree = ""
for line in f:
  if line.isspace():
    parse_trees_text.append(tree)
tree = ""
  elif "(. ...))" in line:
#print "YES"
tree = tree+')'
parse_trees_text.append(tree)
tree = ""
  else:
tree = tree + line

parse_trees=[]
for t in parse_trees_text:
  tree = nltk.Tree(t)
  tree.__delitem__(len(tree)-1) #delete "(. .))" from tree (you don't need that)
  s = traverse(tree)
  parse_trees.append(tree)

0 讨论(0)

上一页 1 2 3

Stanford Parser and NLTK

Solution for Parser alone:

English

Chinese

Arabic

French

German

Spanish