Stanford Parser and NLTK

后端 未结 18 2360
既然无缘
既然无缘 2020-11-22 01:32

Is it possible to use Stanford Parser in NLTK? (I am not talking about Stanford POS.)

相关标签:
18条回答
  • 2020-11-22 02:12

    Deprecated Answer

    The answer below is deprecated, please use the solution on https://stackoverflow.com/a/51981566/610569 for NLTK v3.3 and above.


    EDITED

    Note: The following answer will only work on:

    • NLTK version ==3.2.5
    • Stanford Tools compiled since 2016-10-31
    • Python 2.7, 3.5 and 3.6

    As both tools changes rather quickly and the API might look very different 3-6 months later. Please treat the following answer as temporal and not an eternal fix.

    Always refer to https://github.com/nltk/nltk/wiki/Installing-Third-Party-Software for the latest instruction on how to interface Stanford NLP tools using NLTK!!

    TL;DR

    The follow code comes from https://github.com/nltk/nltk/pull/1735#issuecomment-306091826

    In terminal:

    wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
    unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31
    
    java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
    -preload tokenize,ssplit,pos,lemma,parse,depparse \
    -status_port 9000 -port 9000 -timeout 15000
    

    In Python:

    >>> from nltk.tag.stanford import CoreNLPPOSTagger, CoreNLPNERTagger
    >>> from nltk.parse.corenlp import CoreNLPParser
    
    >>> stpos, stner = CoreNLPPOSTagger(), CoreNLPNERTagger()
    
    >>> stpos.tag('What is the airspeed of an unladen swallow ?'.split())
    [(u'What', u'WP'), (u'is', u'VBZ'), (u'the', u'DT'), (u'airspeed', u'NN'), (u'of', u'IN'), (u'an', u'DT'), (u'unladen', u'JJ'), (u'swallow', u'VB'), (u'?', u'.')]
    
    >>> stner.tag('Rami Eid is studying at Stony Brook University in NY'.split())
    [(u'Rami', u'PERSON'), (u'Eid', u'PERSON'), (u'is', u'O'), (u'studying', u'O'), (u'at', u'O'), (u'Stony', u'ORGANIZATION'), (u'Brook', u'ORGANIZATION'), (u'University', u'ORGANIZATION'), (u'in', u'O'), (u'NY', u'O')]
    
    
    >>> parser = CoreNLPParser(url='http://localhost:9000')
    
    >>> next(
    ...     parser.raw_parse('The quick brown fox jumps over the lazy dog.')
    ... ).pretty_print()  # doctest: +NORMALIZE_WHITESPACE
                         ROOT
                          |
                          S
           _______________|__________________________
          |                         VP               |
          |                _________|___             |
          |               |             PP           |
          |               |     ________|___         |
          NP              |    |            NP       |
      ____|__________     |    |     _______|____    |
     DT   JJ    JJ   NN  VBZ   IN   DT      JJ   NN  .
     |    |     |    |    |    |    |       |    |   |
    The quick brown fox jumps over the     lazy dog  .
    
    >>> (parse_fox, ), (parse_wolf, ) = parser.raw_parse_sents(
    ...     [
    ...         'The quick brown fox jumps over the lazy dog.',
    ...         'The quick grey wolf jumps over the lazy fox.',
    ...     ]
    ... )
    
    >>> parse_fox.pretty_print()  # doctest: +NORMALIZE_WHITESPACE
                         ROOT
                          |
                          S
           _______________|__________________________
          |                         VP               |
          |                _________|___             |
          |               |             PP           |
          |               |     ________|___         |
          NP              |    |            NP       |
      ____|__________     |    |     _______|____    |
     DT   JJ    JJ   NN  VBZ   IN   DT      JJ   NN  .
     |    |     |    |    |    |    |       |    |   |
    The quick brown fox jumps over the     lazy dog  .
    
    >>> parse_wolf.pretty_print()  # doctest: +NORMALIZE_WHITESPACE
                         ROOT
                          |
                          S
           _______________|__________________________
          |                         VP               |
          |                _________|___             |
          |               |             PP           |
          |               |     ________|___         |
          NP              |    |            NP       |
      ____|_________      |    |     _______|____    |
     DT   JJ   JJ   NN   VBZ   IN   DT      JJ   NN  .
     |    |    |    |     |    |    |       |    |   |
    The quick grey wolf jumps over the     lazy fox  .
    
    >>> (parse_dog, ), (parse_friends, ) = parser.parse_sents(
    ...     [
    ...         "I 'm a dog".split(),
    ...         "This is my friends ' cat ( the tabby )".split(),
    ...     ]
    ... )
    
    >>> parse_dog.pretty_print()  # doctest: +NORMALIZE_WHITESPACE
            ROOT
             |
             S
      _______|____
     |            VP
     |    ________|___
     NP  |            NP
     |   |         ___|___
    PRP VBP       DT      NN
     |   |        |       |
     I   'm       a      dog
    

    Please take a look at http://www.nltk.org/_modules/nltk/parse/corenlp.html for more information on of the Stanford API. Take a look at the docstrings!

    0 讨论(0)
  • 2020-11-22 02:13

    Deprecated Answer

    The answer below is deprecated, please use the solution on https://stackoverflow.com/a/51981566/610569 for NLTK v3.3 and above.


    Edited

    As of the current Stanford parser (2015-04-20), the default output for the lexparser.sh has changed so the script below will not work.

    But this answer is kept for legacy sake, it will still work with http://nlp.stanford.edu/software/stanford-parser-2012-11-12.zip though.


    Original Answer

    I suggest you don't mess with Jython, JPype. Let python do python stuff and let java do java stuff, get the Stanford Parser output through the console.

    After you've installed the Stanford Parser in your home directory ~/, just use this python recipe to get the flat bracketed parse:

    import os
    sentence = "this is a foo bar i want to parse."
    
    os.popen("echo '"+sentence+"' > ~/stanfordtemp.txt")
    parser_out = os.popen("~/stanford-parser-2012-11-12/lexparser.sh ~/stanfordtemp.txt").readlines()
    
    bracketed_parse = " ".join( [i.strip() for i in parser_out if i.strip()[0] == "("] )
    print bracketed_parse
    
    0 讨论(0)
  • 2020-11-22 02:13

    If I remember well, the Stanford parser is a java library, therefore you must have a Java interpreter running on your server/computer.

    I used it once a server, combined with a php script. The script used php's exec() function to make a command-line call to the parser like so:

    <?php
    
    exec( "java -cp /pathTo/stanford-parser.jar -mx100m edu.stanford.nlp.process.DocumentPreprocessor /pathTo/fileToParse > /pathTo/resultFile 2>/dev/null" );
    
    ?>
    

    I don't remember all the details of this command, it basically opened the fileToParse, parsed it, and wrote the output in the resultFile. PHP would then open the result file for further use.

    The end of the command directs the parser's verbose to NULL, to prevent unnecessary command line information from disturbing the script.

    I don't know much about Python, but there might be a way to make command line calls.

    It might not be the exact route you were hoping for, but hopefully it'll give you some inspiration. Best of luck.

    0 讨论(0)
  • 2020-11-22 02:15

    Note that this answer applies to NLTK v 3.0, and not to more recent versions.

    Sure, try the following in Python:

    import os
    from nltk.parse import stanford
    os.environ['STANFORD_PARSER'] = '/path/to/standford/jars'
    os.environ['STANFORD_MODELS'] = '/path/to/standford/jars'
    
    parser = stanford.StanfordParser(model_path="/location/of/the/englishPCFG.ser.gz")
    sentences = parser.raw_parse_sents(("Hello, My name is Melroy.", "What is your name?"))
    print sentences
    
    # GUI
    for line in sentences:
        for sentence in line:
            sentence.draw()
    

    Output:

    [Tree('ROOT', [Tree('S', [Tree('INTJ', [Tree('UH', ['Hello'])]), Tree(',', [',']), Tree('NP', [Tree('PRP$', ['My']), Tree('NN', ['name'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('ADJP', [Tree('JJ', ['Melroy'])])]), Tree('.', ['.'])])]), Tree('ROOT', [Tree('SBARQ', [Tree('WHNP', [Tree('WP', ['What'])]), Tree('SQ', [Tree('VBZ', ['is']), Tree('NP', [Tree('PRP$', ['your']), Tree('NN', ['name'])])]), Tree('.', ['?'])])])]

    Note 1: In this example both the parser & model jars are in the same folder.

    Note 2:

    • File name of stanford parser is: stanford-parser.jar
    • File name of stanford models is: stanford-parser-x.x.x-models.jar

    Note 3: The englishPCFG.ser.gz file can be found inside the models.jar file (/edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz). Please use come archive manager to 'unzip' the models.jar file.

    Note 4: Be sure you are using Java JRE (Runtime Environment) 1.8 also known as Oracle JDK 8. Otherwise you will get: Unsupported major.minor version 52.0.

    Installation

    1. Download NLTK v3 from: https://github.com/nltk/nltk. And install NLTK:

      sudo python setup.py install

    2. You can use the NLTK downloader to get Stanford Parser, using Python:

      import nltk
      nltk.download()
      
    3. Try my example! (don't forget the change the jar paths and change the model path to the ser.gz location)

    OR:

    1. Download and install NLTK v3, same as above.

    2. Download the latest version from (current version filename is stanford-parser-full-2015-01-29.zip): http://nlp.stanford.edu/software/lex-parser.shtml#Download

    3. Extract the standford-parser-full-20xx-xx-xx.zip.

    4. Create a new folder ('jars' in my example). Place the extracted files into this jar folder: stanford-parser-3.x.x-models.jar and stanford-parser.jar.

      As shown above you can use the environment variables (STANFORD_PARSER & STANFORD_MODELS) to point to this 'jars' folder. I'm using Linux, so if you use Windows please use something like: C://folder//jars.

    5. Open the stanford-parser-3.x.x-models.jar using an Archive manager (7zip).

    6. Browse inside the jar file; edu/stanford/nlp/models/lexparser. Again, extract the file called 'englishPCFG.ser.gz'. Remember the location where you extract this ser.gz file.

    7. When creating a StanfordParser instance, you can provide the model path as parameter. This is the complete path to the model, in our case /location/of/englishPCFG.ser.gz.

    8. Try my example! (don't forget the change the jar paths and change the model path to the ser.gz location)

    0 讨论(0)
  • 2020-11-22 02:16

    Note that this answer applies to NLTK v 3.0, and not to more recent versions.

    Here is an adaptation of danger98's code that works with nltk3.0.0 on windoze, and presumably the other platforms as well, adjust directory names as appropriate for your setup:

    import os
    from nltk.parse import stanford
    os.environ['STANFORD_PARSER'] = 'd:/stanford-parser'
    os.environ['STANFORD_MODELS'] = 'd:/stanford-parser'
    os.environ['JAVAHOME'] = 'c:/Program Files/java/jre7/bin'
    
    parser = stanford.StanfordParser(model_path="d:/stanford-grammars/englishPCFG.ser.gz")
    sentences = parser.raw_parse_sents(("Hello, My name is Melroy.", "What is your name?"))
    print sentences
    

    Note that the parsing command has changed (see the source code at www.nltk.org/_modules/nltk/parse/stanford.html), and that you need to define the JAVAHOME variable. I tried to get it to read the grammar file in situ in the jar, but have so far failed to do that.

    0 讨论(0)
  • 2020-11-22 02:16

    I am using nltk version 3.2.4. And following code worked for me.

    from nltk.internals import find_jars_within_path
    from nltk.tag import StanfordPOSTagger
    from nltk import word_tokenize
    
    # Alternatively to setting the CLASSPATH add the jar and model via their 
    path:
    jar = '/home/ubuntu/stanford-postagger-full-2017-06-09/stanford-postagger.jar'
    model = '/home/ubuntu/stanford-postagger-full-2017-06-09/models/english-left3words-distsim.tagger'
    
    pos_tagger = StanfordPOSTagger(model, jar)
    
    # Add other jars from Stanford directory
    stanford_dir = pos_tagger._stanford_jar.rpartition('/')[0]
    stanford_jars = find_jars_within_path(stanford_dir)
    pos_tagger._stanford_jar = ':'.join(stanford_jars)
    
    text = pos_tagger.tag(word_tokenize("Open app and play movie"))
    print(text)
    

    Output:

    [('Open', 'VB'), ('app', 'NN'), ('and', 'CC'), ('play', 'VB'), ('movie', 'NN')]
    
    0 讨论(0)
提交回复
热议问题