How to find the shortest dependency path between two words in Python?

前端 未结 3 1905
无人共我
无人共我 2021-02-07 07:25

I try to find the dependency path between two words in Python given dependency tree.

For sentence

Robots in popular culture are there to remind u

相关标签:
3条回答
  • 2021-02-07 07:51

    HugoMailhot's answer is great. I'll write something similar for spacy users who want to find the shortest dependency path between two words (whereas HugoMailhot's answer relies on practNLPTools).

    The sentence:

    Robots in popular culture are there to remind us of the awesomeness of unbound human agency.

    has the following dependency tree:

    Here is the code to find the shortest dependency path between two words:

    import networkx as nx
    import spacy
    nlp = spacy.load('en')
    
    # https://spacy.io/docs/usage/processing-text
    document = nlp(u'Robots in popular culture are there to remind us of the awesomeness of unbound human agency.', parse=True)
    
    print('document: {0}'.format(document))
    
    # Load spacy's dependency tree into a networkx graph
    edges = []
    for token in document:
        # FYI https://spacy.io/docs/api/token
        for child in token.children:
            edges.append(('{0}-{1}'.format(token.lower_,token.i),
                          '{0}-{1}'.format(child.lower_,child.i)))
    
    graph = nx.Graph(edges)
    
    # https://networkx.github.io/documentation/networkx-1.10/reference/algorithms.shortest_paths.html
    print(nx.shortest_path_length(graph, source='robots-0', target='awesomeness-11'))
    print(nx.shortest_path(graph, source='robots-0', target='awesomeness-11'))
    print(nx.shortest_path(graph, source='robots-0', target='agency-15'))
    

    Output:

    4
    ['robots-0', 'are-4', 'remind-7', 'of-9', 'awesomeness-11']
    ['robots-0', 'are-4', 'remind-7', 'of-9', 'awesomeness-11', 'of-12', 'agency-15']
    

    To install spacy and networkx:

    sudo pip install networkx 
    sudo pip install spacy
    sudo python -m spacy.en.download parser # will take 0.5 GB
    

    Some benchmarks regarding spacy's dependency parsing: https://spacy.io/docs/api/

    0 讨论(0)
  • 2021-02-07 07:55

    This answer relies on Stanford CoreNLP to obtain the dependency tree of a sentence. It borrows quite some code from HugoMailhot's answer when using networkx.

    Before running the code, one needs to:

    1. sudo pip install pycorenlp (python interface for Stanford CoreNLP)
    2. Download Stanford CoreNLP
    3. Start a Stanford CoreNLP server as follows:

      java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 50000
      

    Then one can run the following code to find the shortest dependency path between two words:

    import networkx as nx
    from pycorenlp import StanfordCoreNLP
    from pprint import pprint
    
    nlp = StanfordCoreNLP('http://localhost:{0}'.format(9000))
    def get_stanford_annotations(text, port=9000,
                                 annotators='tokenize,ssplit,pos,lemma,depparse,parse'):
        output = nlp.annotate(text, properties={
            "timeout": "10000",
            "ssplit.newlineIsSentenceBreak": "two",
            'annotators': annotators,
            'outputFormat': 'json'
        })
        return output
    
    # The code expects the document to contains exactly one sentence.
    document =  'Robots in popular culture are there to remind us of the awesomeness of'\
                'unbound human agency.'
    print('document: {0}'.format(document))
    
    # Parse the text
    annotations = get_stanford_annotations(document, port=9000,
                                           annotators='tokenize,ssplit,pos,lemma,depparse')
    tokens = annotations['sentences'][0]['tokens']
    
    # Load Stanford CoreNLP's dependency tree into a networkx graph
    edges = []
    dependencies = {}
    for edge in annotations['sentences'][0]['basic-dependencies']:
        edges.append((edge['governor'], edge['dependent']))
        dependencies[(min(edge['governor'], edge['dependent']),
                      max(edge['governor'], edge['dependent']))] = edge
    
    graph = nx.Graph(edges)
    #pprint(dependencies)
    #print('edges: {0}'.format(edges))
    
    # Find the shortest path
    token1 = 'Robots'
    token2 = 'awesomeness'
    for token in tokens:
        if token1 == token['originalText']:
            token1_index = token['index']
        if token2 == token['originalText']:
            token2_index = token['index']
    
    path = nx.shortest_path(graph, source=token1_index, target=token2_index)
    print('path: {0}'.format(path))
    
    for token_id in path:
        token = tokens[token_id-1]
        token_text = token['originalText']
        print('Node {0}\ttoken_text: {1}'.format(token_id,token_text))
    

    The output is:

    document: Robots in popular culture are there to remind us of the awesomeness of unbound human agency.
    path: [1, 5, 8, 12]
    Node 1  token_text: Robots
    Node 5  token_text: are
    Node 8  token_text: remind
    Node 12 token_text: awesomeness
    

    Note that Stanford CoreNLP can be tested online: http://nlp.stanford.edu:8080/parser/index.jsp

    This answer was tested with Stanford CoreNLP 3.6.0., pycorenlp 0.3.0 and python 3.5 x64 on Windows 7 SP1 x64 Ultimate.

    0 讨论(0)
  • 2021-02-07 07:57

    Your problem can easily be conceived as a graph problem where we have to find the shortest path between two nodes.

    To convert your dependency parse in a graph, we first have to deal with the fact that it comes as a string. You want to get this:

    'nsubj(are-5, Robots-1)\nxsubj(remind-8, Robots-1)\namod(culture-4, popular-3)\nprep_in(Robots-1, culture-4)\nroot(ROOT-0, are-5)\nadvmod(are-5, there-6)\naux(remind-8, to-7)\nxcomp(are-5, remind-8)\ndobj(remind-8, us-9)\ndet(awesomeness-12, the-11)\nprep_of(remind-8, awesomeness-12)\namod(agency-16, unbound-14)\namod(agency-16, human-15)\nprep_of(awesomeness-12, agency-16)'
    

    to look like this:

    [('are-5', 'Robots-1'), ('remind-8', 'Robots-1'), ('culture-4', 'popular-3'), ('Robots-1', 'culture-4'), ('ROOT-0', 'are-5'), ('are-5', 'there-6'), ('remind-8', 'to-7'), ('are-5', 'remind-8'), ('remind-8', 'us-9'), ('awesomeness-12', 'the-11'), ('remind-8', 'awesomeness-12'), ('agency-16', 'unbound-14'), ('agency-16', 'human-15'), ('awesomeness-12', 'agency-16')]
    

    This way you can feed the tuple list to a graph constructor from the networkx module that will analyze the list and build a graph for you, plus give you a neat method that gives you the length of the shortest path between two given nodes.

    Necessary imports

    import re
    import networkx as nx
    from practnlptools.tools import Annotator
    

    How to get your string in the desired tuple list format

    annotator = Annotator()
    text = """Robots in popular culture are there to remind us of the awesomeness of unbound human agency."""
    dep_parse = annotator.getAnnotations(text, dep_parse=True)['dep_parse']
    
    dp_list = dep_parse.split('\n')
    pattern = re.compile(r'.+?\((.+?), (.+?)\)')
    edges = []
    for dep in dp_list:
        m = pattern.search(dep)
        edges.append((m.group(1), m.group(2)))
    

    How to build the graph

    graph = nx.Graph(edges)  # Well that was easy
    

    How to compute shortest path length

    print(nx.shortest_path_length(graph, source='Robots-1', target='awesomeness-12'))
    

    This script will reveal that the shortest path given the dependency parse is actually of length 2, since you can get from Robots-1 to awesomeness-12 by going through remind-8

    1. xsubj(remind-8, Robots-1) 
    2. prep_of(remind-8, awesomeness-12)
    

    If you don't like this result, you might want to think about filtering some dependencies, in this case not allow the xsubj dependency to be added to the graph.

    0 讨论(0)
提交回复
热议问题