Stanford Parser and NLTK

后端 未结 18 2361
既然无缘
既然无缘 2020-11-22 01:32

Is it possible to use Stanford Parser in NLTK? (I am not talking about Stanford POS.)

相关标签:
18条回答
  • 2020-11-22 02:27

    A new development of the Stanford parser based on a neural model, trained using Tensorflow is very recently made available to be used as a python API. This model is supposed to be far more accurate than the Java-based moel. You can certainly integrate with an NLTK pipeline.

    Link to the parser. Ther repository contains pre-trained parser models for 53 languages.

    0 讨论(0)
  • 2020-11-22 02:30

    The Stanford Core NLP software page has a list of python wrappers:

    http://nlp.stanford.edu/software/corenlp.shtml#Extensions

    0 讨论(0)
  • 2020-11-22 02:30

    Note that this answer applies to NLTK v 3.0, and not to more recent versions.

    I cannot leave this as a comment because of reputation, but since I spent (wasted?) some time solving this I would rather share my problem/solution to get this parser to work in NLTK.

    In the excellent answer from alvas, it is mentioned that:

    e.g. for the Parser, there won't be a model directory.

    This led me wrongly to:

    • not be careful to the value I put to STANFORD_MODELS (and only care about my CLASSPATH)
    • leave ../path/tostanford-parser-full-2015-2012-09/models directory * virtually empty* (or with a jar file whose name did not match nltk regex)!

    If the OP, like me, just wanted to use the parser, it may be confusing that when not downloading anything else (no POStagger, no NER,...) and following all these instructions, we still get an error.

    Eventually, for any CLASSPATH given (following examples and explanations in answers from this thread) I would still get the error:

    NLTK was unable to find stanford-parser-(\d+)(.(\d+))+-models.jar! Set the CLASSPATH environment variable. For more information, on stanford-parser-(\d+)(.(\d+))+-models.jar,

    see: http://nlp.stanford.edu/software/lex-parser.shtml

    OR:

    NLTK was unable to find stanford-parser.jar! Set the CLASSPATH environment variable. For more information, on stanford-parser.jar, see: http://nlp.stanford.edu/software/lex-parser.shtml

    Though, importantly, I could correctly load and use the parser if I called the function with all arguments and path fully specified, as in:

    stanford_parser_jar = '../lib/stanford-parser-full-2015-04-20/stanford-parser.jar'
    stanford_model_jar = '../lib/stanford-parser-full-2015-04-20/stanfor-parser-3.5.2-models.jar'    
    parser = StanfordParser(path_to_jar=stanford_parser_jar, 
                        path_to_models_jar=stanford_model_jar)
    

    Solution for Parser alone:

    Therefore the error came from NLTK and how it is looking for jars using the supplied STANFORD_MODELS and CLASSPATH environment variables. To solve this, the *-models.jar, with the correct formatting (to match the regex in NLTK code, so no -corenlp-....jar) must be located in the folder designated by STANFORD_MODELS.

    Namely, I first created:

    mkdir stanford-parser-full-2015-12-09/models
    

    Then added in .bashrc:

    export STANFORD_MODELS=/path/to/stanford-parser-full-2015-12-09/models
    

    And finally, by copying stanford-parser-3.6.0-models.jar (or corresponding version), into:

    path/to/stanford-parser-full-2015-12-09/models/
    

    I could get StanfordParser to load smoothly in python with the classic CLASSPATH that points to stanford-parser.jar. Actually, as such, you can call StanfordParser with no parameters, the default will just work.

    0 讨论(0)
  • 2020-11-22 02:32

    I took many hours and finally found a simple solution for Windows users. Basically its summarized version of an existing answer by alvas, but made easy to follow(hopefully) for those who are new to stanford NLP and are Window users.

    1) Download the module you want to use, such as NER, POS etc. In my case i wanted to use NER, so i downloaded the module from http://nlp.stanford.edu/software/stanford-ner-2015-04-20.zip

    2) Unzip the file.

    3) Set the environment variables(classpath and stanford_modules) from the unzipped folder.

    import os
    os.environ['CLASSPATH'] = "C:/Users/Downloads/stanford-ner-2015-04-20/stanford-ner.jar"
    os.environ['STANFORD_MODELS'] = "C:/Users/Downloads/stanford-ner-2015-04-20/classifiers/"
    

    4) set the environment variables for JAVA, as in where you have JAVA installed. for me it was below

    os.environ['JAVAHOME'] = "C:/Program Files/Java/jdk1.8.0_102/bin/java.exe"
    

    5) import the module you want

    from nltk.tag import StanfordNERTagger
    

    6) call the pretrained model which is present in classifier folder in the unzipped folder. add ".gz" in the end for file extension. for me the model i wanted to use was english.all.3class.distsim.crf.ser

    st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')
    

    7) Now execute the parser!! and we are done!!

    st.tag('Rami Eid is studying at Stony Brook University in NY'.split())
    
    0 讨论(0)
  • 2020-11-22 02:36

    As of NLTK v3.3, users should avoid the Stanford NER or POS taggers from nltk.tag, and avoid Stanford tokenizer/segmenter from nltk.tokenize.

    Instead use the new nltk.parse.corenlp.CoreNLPParser API.

    Please see https://github.com/nltk/nltk/wiki/Stanford-CoreNLP-API-in-NLTK


    (Avoiding link only answer, I've pasted the docs from NLTK github wiki below)

    First, update your NLTK

    pip3 install -U nltk # Make sure is >=3.3
    

    Then download the necessary CoreNLP packages:

    cd ~
    wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-02-27.zip
    unzip stanford-corenlp-full-2018-02-27.zip
    cd stanford-corenlp-full-2018-02-27
    
    # Get the Chinese model 
    wget http://nlp.stanford.edu/software/stanford-chinese-corenlp-2018-02-27-models.jar
    wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-chinese.properties 
    
    # Get the Arabic model
    wget http://nlp.stanford.edu/software/stanford-arabic-corenlp-2018-02-27-models.jar
    wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-arabic.properties 
    
    # Get the French model
    wget http://nlp.stanford.edu/software/stanford-french-corenlp-2018-02-27-models.jar
    wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-french.properties 
    
    # Get the German model
    wget http://nlp.stanford.edu/software/stanford-german-corenlp-2018-02-27-models.jar
    wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-german.properties 
    
    
    # Get the Spanish model
    wget http://nlp.stanford.edu/software/stanford-spanish-corenlp-2018-02-27-models.jar
    wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-spanish.properties 
    

    English

    Still in the stanford-corenlp-full-2018-02-27 directory, start the server:

    java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
    -preload tokenize,ssplit,pos,lemma,ner,parse,depparse \
    -status_port 9000 -port 9000 -timeout 15000 & 
    

    Then in Python:

    >>> from nltk.parse import CoreNLPParser
    
    # Lexical Parser
    >>> parser = CoreNLPParser(url='http://localhost:9000')
    
    # Parse tokenized text.
    >>> list(parser.parse('What is the airspeed of an unladen swallow ?'.split()))
    [Tree('ROOT', [Tree('SBARQ', [Tree('WHNP', [Tree('WP', ['What'])]), Tree('SQ', [Tree('VBZ', ['is']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['airspeed'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['an']), Tree('JJ', ['unladen'])])]), Tree('S', [Tree('VP', [Tree('VB', ['swallow'])])])])]), Tree('.', ['?'])])])]
    
    # Parse raw string.
    >>> list(parser.raw_parse('What is the airspeed of an unladen swallow ?'))
    [Tree('ROOT', [Tree('SBARQ', [Tree('WHNP', [Tree('WP', ['What'])]), Tree('SQ', [Tree('VBZ', ['is']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['airspeed'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['an']), Tree('JJ', ['unladen'])])]), Tree('S', [Tree('VP', [Tree('VB', ['swallow'])])])])]), Tree('.', ['?'])])])]
    
    # Neural Dependency Parser
    >>> from nltk.parse.corenlp import CoreNLPDependencyParser
    >>> dep_parser = CoreNLPDependencyParser(url='http://localhost:9000')
    >>> parses = dep_parser.parse('What is the airspeed of an unladen swallow ?'.split())
    >>> [[(governor, dep, dependent) for governor, dep, dependent in parse.triples()] for parse in parses]
    [[(('What', 'WP'), 'cop', ('is', 'VBZ')), (('What', 'WP'), 'nsubj', ('airspeed', 'NN')), (('airspeed', 'NN'), 'det', ('the', 'DT')), (('airspeed', 'NN'), 'nmod', ('swallow', 'VB')), (('swallow', 'VB'), 'case', ('of', 'IN')), (('swallow', 'VB'), 'det', ('an', 'DT')), (('swallow', 'VB'), 'amod', ('unladen', 'JJ')), (('What', 'WP'), 'punct', ('?', '.'))]]
    
    
    # Tokenizer
    >>> parser = CoreNLPParser(url='http://localhost:9000')
    >>> list(parser.tokenize('What is the airspeed of an unladen swallow?'))
    ['What', 'is', 'the', 'airspeed', 'of', 'an', 'unladen', 'swallow', '?']
    
    # POS Tagger
    >>> pos_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='pos')
    >>> list(pos_tagger.tag('What is the airspeed of an unladen swallow ?'.split()))
    [('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'), ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'), ('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]
    
    # NER Tagger
    >>> ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
    >>> list(ner_tagger.tag(('Rami Eid is studying at Stony Brook University in NY'.split())))
    [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'STATE_OR_PROVINCE')]
    

    Chinese

    Start the server a little differently, still from the `stanford-corenlp-full-2018-02-27 directory:

    java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
    -serverProperties StanfordCoreNLP-chinese.properties \
    -preload tokenize,ssplit,pos,lemma,ner,parse \
    -status_port 9001  -port 9001 -timeout 15000
    

    In Python:

    >>> parser = CoreNLPParser('http://localhost:9001')
    >>> list(parser.tokenize(u'我家没有电脑。'))
    ['我家', '没有', '电脑', '。']
    
    >>> list(parser.parse(parser.tokenize(u'我家没有电脑。')))
    [Tree('ROOT', [Tree('IP', [Tree('IP', [Tree('NP', [Tree('NN', ['我家'])]), Tree('VP', [Tree('VE', ['没有']), Tree('NP', [Tree('NN', ['电脑'])])])]), Tree('PU', ['。'])])])]
    

    Arabic

    Start the server:

    java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
    -serverProperties StanfordCoreNLP-arabic.properties \
    -preload tokenize,ssplit,pos,parse \
    -status_port 9005  -port 9005 -timeout 15000
    

    In Python:

    >>> from nltk.parse import CoreNLPParser
    >>> parser = CoreNLPParser('http://localhost:9005')
    >>> text = u'انا حامل'
    
    # Parser.
    >>> parser.raw_parse(text)
    <list_iterator object at 0x7f0d894c9940>
    >>> list(parser.raw_parse(text))
    [Tree('ROOT', [Tree('S', [Tree('NP', [Tree('PRP', ['انا'])]), Tree('NP', [Tree('NN', ['حامل'])])])])]
    >>> list(parser.parse(parser.tokenize(text)))
    [Tree('ROOT', [Tree('S', [Tree('NP', [Tree('PRP', ['انا'])]), Tree('NP', [Tree('NN', ['حامل'])])])])]
    
    # Tokenizer / Segmenter.
    >>> list(parser.tokenize(text))
    ['انا', 'حامل']
    
    # POS tagg
    >>> pos_tagger = CoreNLPParser('http://localhost:9005', tagtype='pos')
    >>> list(pos_tagger.tag(parser.tokenize(text)))
    [('انا', 'PRP'), ('حامل', 'NN')]
    
    
    # NER tag
    >>> ner_tagger = CoreNLPParser('http://localhost:9005', tagtype='ner')
    >>> list(ner_tagger.tag(parser.tokenize(text)))
    [('انا', 'O'), ('حامل', 'O')]
    

    French

    Start the server:

    java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
    -serverProperties StanfordCoreNLP-french.properties \
    -preload tokenize,ssplit,pos,parse \
    -status_port 9004  -port 9004 -timeout 15000
    

    In Python:

    >>> parser = CoreNLPParser('http://localhost:9004')
    >>> list(parser.parse('Je suis enceinte'.split()))
    [Tree('ROOT', [Tree('SENT', [Tree('NP', [Tree('PRON', ['Je']), Tree('VERB', ['suis']), Tree('AP', [Tree('ADJ', ['enceinte'])])])])])]
    >>> pos_tagger = CoreNLPParser('http://localhost:9004', tagtype='pos')
    >>> pos_tagger.tag('Je suis enceinte'.split())
    [('Je', 'PRON'), ('suis', 'VERB'), ('enceinte', 'ADJ')]
    

    German

    Start the server:

    java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
    -serverProperties StanfordCoreNLP-german.properties \
    -preload tokenize,ssplit,pos,ner,parse \
    -status_port 9002  -port 9002 -timeout 15000
    

    In Python:

    >>> parser = CoreNLPParser('http://localhost:9002')
    >>> list(parser.raw_parse('Ich bin schwanger'))
    [Tree('ROOT', [Tree('NUR', [Tree('S', [Tree('PPER', ['Ich']), Tree('VAFIN', ['bin']), Tree('AP', [Tree('ADJD', ['schwanger'])])])])])]
    >>> list(parser.parse('Ich bin schwanger'.split()))
    [Tree('ROOT', [Tree('NUR', [Tree('S', [Tree('PPER', ['Ich']), Tree('VAFIN', ['bin']), Tree('AP', [Tree('ADJD', ['schwanger'])])])])])]
    
    
    >>> pos_tagger = CoreNLPParser('http://localhost:9002', tagtype='pos')
    >>> pos_tagger.tag('Ich bin schwanger'.split())
    [('Ich', 'PPER'), ('bin', 'VAFIN'), ('schwanger', 'ADJD')]
    
    >>> pos_tagger = CoreNLPParser('http://localhost:9002', tagtype='pos')
    >>> pos_tagger.tag('Ich bin schwanger'.split())
    [('Ich', 'PPER'), ('bin', 'VAFIN'), ('schwanger', 'ADJD')]
    
    >>> ner_tagger = CoreNLPParser('http://localhost:9002', tagtype='ner')
    >>> ner_tagger.tag('Donald Trump besuchte Angela Merkel in Berlin.'.split())
    [('Donald', 'PERSON'), ('Trump', 'PERSON'), ('besuchte', 'O'), ('Angela', 'PERSON'), ('Merkel', 'PERSON'), ('in', 'O'), ('Berlin', 'LOCATION'), ('.', 'O')]
    

    Spanish

    Start the server:

    java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
    -serverProperties StanfordCoreNLP-spanish.properties \
    -preload tokenize,ssplit,pos,ner,parse \
    -status_port 9003  -port 9003 -timeout 15000
    

    In Python:

    >>> pos_tagger = CoreNLPParser('http://localhost:9003', tagtype='pos')
    >>> pos_tagger.tag(u'Barack Obama salió con Michael Jackson .'.split())
    [('Barack', 'PROPN'), ('Obama', 'PROPN'), ('salió', 'VERB'), ('con', 'ADP'), ('Michael', 'PROPN'), ('Jackson', 'PROPN'), ('.', 'PUNCT')]
    >>> ner_tagger = CoreNLPParser('http://localhost:9003', tagtype='ner')
    >>> ner_tagger.tag(u'Barack Obama salió con Michael Jackson .'.split())
    [('Barack', 'PERSON'), ('Obama', 'PERSON'), ('salió', 'O'), ('con', 'O'), ('Michael', 'PERSON'), ('Jackson', 'PERSON'), ('.', 'O')]
    
    0 讨论(0)
  • 2020-11-22 02:37

    You can use the Stanford Parsers output to create a Tree in nltk (nltk.tree.Tree).

    Assuming the stanford parser gives you a file in which there is exactly one parse tree for every sentence. Then this example works, though it might not look very pythonic:

    f = open(sys.argv[1]+".output"+".30"+".stp", "r")
    parse_trees_text=[]
    tree = ""
    for line in f:
      if line.isspace():
        parse_trees_text.append(tree)
    tree = ""
      elif "(. ...))" in line:
    #print "YES"
    tree = tree+')'
    parse_trees_text.append(tree)
    tree = ""
      else:
    tree = tree + line
    
    parse_trees=[]
    for t in parse_trees_text:
      tree = nltk.Tree(t)
      tree.__delitem__(len(tree)-1) #delete "(. .))" from tree (you don't need that)
      s = traverse(tree)
      parse_trees.append(tree)
    
    0 讨论(0)
提交回复
热议问题