How do I do dependency parsing in NLTK?

前端 未结 7 888
醉梦人生
醉梦人生 2020-11-29 20:53

Going through the NLTK book, it\'s not clear how to generate a dependency tree from a given sentence.

The relevant section of the book: sub-chapter on dependency gra

相关标签:
7条回答
  • 2020-11-29 21:04

    From the Stanford Parser documentation: "the dependencies can be obtained using our software [...] on phrase-structure trees using the EnglishGrammaticalStructure class available in the parser package." http://nlp.stanford.edu/software/stanford-dependencies.shtml

    The dependencies manual also mentions: "Or our conversion tool can convert the output of other constituency parsers to the Stanford Dependencies representation." http://nlp.stanford.edu/software/dependencies_manual.pdf

    Neither functionality seem to be implemented in NLTK currently.

    0 讨论(0)
  • 2020-11-29 21:07

    To use Stanford Parser from NLTK

    1) Run CoreNLP Server at localhost
    Download Stanford CoreNLP here (and also model file for your language). The server can be started by running the following command (more details here)

    # Run the server using all jars in the current directory (e.g., the CoreNLP home directory)
    java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
    

    or by NLTK API (need to configure the CORENLP_HOME environment variable first)

    os.environ["CORENLP_HOME"] = "dir"
    client = corenlp.CoreNLPClient()
    # do something
    client.stop()
    

    2) Call the dependency parser from NLTK

    >>> from nltk.parse.corenlp import CoreNLPDependencyParser
    >>> dep_parser = CoreNLPDependencyParser(url='http://localhost:9000')
    >>> parse, = dep_parser.raw_parse(
    ...     'The quick brown fox jumps over the lazy dog.'
    ... )
    >>> print(parse.to_conll(4))  
    The     DT      4       det
    quick   JJ      4       amod
    brown   JJ      4       amod
    fox     NN      5       nsubj
    jumps   VBZ     0       ROOT
    over    IN      9       case
    the     DT      9       det
    lazy    JJ      9       amod
    dog     NN      5       nmod
    .       .       5       punct
    

    See detail documentation here, also this question NLTK CoreNLPDependencyParser: Failed to establish connection.

    0 讨论(0)
  • 2020-11-29 21:11

    If you want to be serious about dependance parsing don't use the NLTK, all the algorithms are dated, and slow. Try something like this: https://spacy.io/

    0 讨论(0)
  • 2020-11-29 21:13

    If you need better performance, then spacy (https://spacy.io/) is the best choice. Usage is very simple:

    import spacy
    
    nlp = spacy.load('en')
    sents = nlp(u'A woman is walking through the door.')
    

    You'll get a dependency tree as output, and you can dig out very easily every information you need. You can also define your own custom pipelines. See more on their website.

    https://spacy.io/docs/usage/

    0 讨论(0)
  • 2020-11-29 21:15

    We can use Stanford Parser from NLTK.

    Requirements

    You need to download two things from their website:

    1. The Stanford CoreNLP parser.
    2. Language model for your desired language (e.g. english language model)

    Warning!

    Make sure that your language model version matches your Stanford CoreNLP parser version!

    The current CoreNLP version as of May 22, 2018 is 3.9.1.

    After downloading the two files, extract the zip file anywhere you like.

    Python Code

    Next, load the model and use it through NLTK

    from nltk.parse.stanford import StanfordDependencyParser
    
    path_to_jar = 'path_to/stanford-parser-full-2014-08-27/stanford-parser.jar'
    path_to_models_jar = 'path_to/stanford-parser-full-2014-08-27/stanford-parser-3.4.1-models.jar'
    
    dependency_parser = StanfordDependencyParser(path_to_jar=path_to_jar, path_to_models_jar=path_to_models_jar)
    
    result = dependency_parser.raw_parse('I shot an elephant in my sleep')
    dep = result.next()
    
    list(dep.triples())
    

    Output

    The output of the last line is:

    [((u'shot', u'VBD'), u'nsubj', (u'I', u'PRP')),
     ((u'shot', u'VBD'), u'dobj', (u'elephant', u'NN')),
     ((u'elephant', u'NN'), u'det', (u'an', u'DT')),
     ((u'shot', u'VBD'), u'prep', (u'in', u'IN')),
     ((u'in', u'IN'), u'pobj', (u'sleep', u'NN')),
     ((u'sleep', u'NN'), u'poss', (u'my', u'PRP$'))]
    

    I think this is what you want.

    0 讨论(0)
  • 2020-11-29 21:15

    A little late to the party, but I wanted to add some example code with SpaCy that gets you your desired output:

    import spacy
    nlp = spacy.load("en_core_web_sm")
    doc = nlp("I shot an elephant in my sleep")
    for token in doc:
        print("{2}({3}-{6}, {0}-{5})".format(token.text, token.tag_, token.dep_, token.head.text, token.head.tag_, token.i+1, token.head.i+1))
    

    And here's the output, very similar to your desired output:

    nsubj(shot-2, I-1)
    ROOT(shot-2, shot-2)
    det(elephant-4, an-3)
    dobj(shot-2, elephant-4)
    prep(shot-2, in-5)
    poss(sleep-7, my-6)
    pobj(in-5, sleep-7)
    

    Hope that helps!

    0 讨论(0)
提交回复
热议问题