How to find the shortest dependency path between two words in Python?

前端未结

关注

 3  1905

无人共我

I try to find the dependency path between two words in Python given dependency tree.

For sentence

Robots in popular culture are there to remind u

相关标签:

3条回答

忘掉有多难

2021-02-07 07:51

HugoMailhot's answer is great. I'll write something similar for spacy users who want to find the shortest dependency path between two words (whereas HugoMailhot's answer relies on practNLPTools).

The sentence:

Robots in popular culture are there to remind us of the awesomeness of unbound human agency.

has the following dependency tree:

Here is the code to find the shortest dependency path between two words:

import networkx as nx
import spacy
nlp = spacy.load('en')

# https://spacy.io/docs/usage/processing-text
document = nlp(u'Robots in popular culture are there to remind us of the awesomeness of unbound human agency.', parse=True)

print('document: {0}'.format(document))

# Load spacy's dependency tree into a networkx graph
edges = []
for token in document:
    # FYI https://spacy.io/docs/api/token
    for child in token.children:
        edges.append(('{0}-{1}'.format(token.lower_,token.i),
                      '{0}-{1}'.format(child.lower_,child.i)))

graph = nx.Graph(edges)

# https://networkx.github.io/documentation/networkx-1.10/reference/algorithms.shortest_paths.html
print(nx.shortest_path_length(graph, source='robots-0', target='awesomeness-11'))
print(nx.shortest_path(graph, source='robots-0', target='awesomeness-11'))
print(nx.shortest_path(graph, source='robots-0', target='agency-15'))

Output:

4
['robots-0', 'are-4', 'remind-7', 'of-9', 'awesomeness-11']
['robots-0', 'are-4', 'remind-7', 'of-9', 'awesomeness-11', 'of-12', 'agency-15']

To install spacy and networkx:

sudo pip install networkx 
sudo pip install spacy
sudo python -m spacy.en.download parser # will take 0.5 GB

Some benchmarks regarding spacy's dependency parsing: https://spacy.io/docs/api/

0 讨论(0)

逝去的感伤

2021-02-07 07:55

This answer relies on Stanford CoreNLP to obtain the dependency tree of a sentence. It borrows quite some code from HugoMailhot's answer when using networkx.

Before running the code, one needs to:

sudo pip install pycorenlp (python interface for Stanford CoreNLP)
Download Stanford CoreNLP

Start a Stanford CoreNLP server as follows:

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 50000

Then one can run the following code to find the shortest dependency path between two words:

import networkx as nx
from pycorenlp import StanfordCoreNLP
from pprint import pprint

nlp = StanfordCoreNLP('http://localhost:{0}'.format(9000))
def get_stanford_annotations(text, port=9000,
                             annotators='tokenize,ssplit,pos,lemma,depparse,parse'):
    output = nlp.annotate(text, properties={
        "timeout": "10000",
        "ssplit.newlineIsSentenceBreak": "two",
        'annotators': annotators,
        'outputFormat': 'json'
    })
    return output

# The code expects the document to contains exactly one sentence.
document =  'Robots in popular culture are there to remind us of the awesomeness of'\
            'unbound human agency.'
print('document: {0}'.format(document))

# Parse the text
annotations = get_stanford_annotations(document, port=9000,
                                       annotators='tokenize,ssplit,pos,lemma,depparse')
tokens = annotations['sentences'][0]['tokens']

# Load Stanford CoreNLP's dependency tree into a networkx graph
edges = []
dependencies = {}
for edge in annotations['sentences'][0]['basic-dependencies']:
    edges.append((edge['governor'], edge['dependent']))
    dependencies[(min(edge['governor'], edge['dependent']),
                  max(edge['governor'], edge['dependent']))] = edge

graph = nx.Graph(edges)
#pprint(dependencies)
#print('edges: {0}'.format(edges))

# Find the shortest path
token1 = 'Robots'
token2 = 'awesomeness'
for token in tokens:
    if token1 == token['originalText']:
        token1_index = token['index']
    if token2 == token['originalText']:
        token2_index = token['index']

path = nx.shortest_path(graph, source=token1_index, target=token2_index)
print('path: {0}'.format(path))

for token_id in path:
    token = tokens[token_id-1]
    token_text = token['originalText']
    print('Node {0}\ttoken_text: {1}'.format(token_id,token_text))

The output is:

document: Robots in popular culture are there to remind us of the awesomeness of unbound human agency.
path: [1, 5, 8, 12]
Node 1  token_text: Robots
Node 5  token_text: are
Node 8  token_text: remind
Node 12 token_text: awesomeness

Note that Stanford CoreNLP can be tested online: http://nlp.stanford.edu:8080/parser/index.jsp

This answer was tested with Stanford CoreNLP 3.6.0., pycorenlp 0.3.0 and python 3.5 x64 on Windows 7 SP1 x64 Ultimate.

0 讨论(0)

栀梦

2021-02-07 07:57

Your problem can easily be conceived as a graph problem where we have to find the shortest path between two nodes.

To convert your dependency parse in a graph, we first have to deal with the fact that it comes as a string. You want to get this:

'nsubj(are-5, Robots-1)\nxsubj(remind-8, Robots-1)\namod(culture-4, popular-3)\nprep_in(Robots-1, culture-4)\nroot(ROOT-0, are-5)\nadvmod(are-5, there-6)\naux(remind-8, to-7)\nxcomp(are-5, remind-8)\ndobj(remind-8, us-9)\ndet(awesomeness-12, the-11)\nprep_of(remind-8, awesomeness-12)\namod(agency-16, unbound-14)\namod(agency-16, human-15)\nprep_of(awesomeness-12, agency-16)'

to look like this:

[('are-5', 'Robots-1'), ('remind-8', 'Robots-1'), ('culture-4', 'popular-3'), ('Robots-1', 'culture-4'), ('ROOT-0', 'are-5'), ('are-5', 'there-6'), ('remind-8', 'to-7'), ('are-5', 'remind-8'), ('remind-8', 'us-9'), ('awesomeness-12', 'the-11'), ('remind-8', 'awesomeness-12'), ('agency-16', 'unbound-14'), ('agency-16', 'human-15'), ('awesomeness-12', 'agency-16')]

This way you can feed the tuple list to a graph constructor from the networkx module that will analyze the list and build a graph for you, plus give you a neat method that gives you the length of the shortest path between two given nodes.

Necessary imports

import re
import networkx as nx
from practnlptools.tools import Annotator

How to get your string in the desired tuple list format

annotator = Annotator()
text = """Robots in popular culture are there to remind us of the awesomeness of unbound human agency."""
dep_parse = annotator.getAnnotations(text, dep_parse=True)['dep_parse']

dp_list = dep_parse.split('\n')
pattern = re.compile(r'.+?\((.+?), (.+?)\)')
edges = []
for dep in dp_list:
    m = pattern.search(dep)
    edges.append((m.group(1), m.group(2)))

How to build the graph

graph = nx.Graph(edges)  # Well that was easy

How to compute shortest path length

print(nx.shortest_path_length(graph, source='Robots-1', target='awesomeness-12'))

This script will reveal that the shortest path given the dependency parse is actually of length 2, since you can get from Robots-1 to awesomeness-12 by going through remind-8

1. xsubj(remind-8, Robots-1) 
2. prep_of(remind-8, awesomeness-12)

If you don't like this result, you might want to think about filtering some dependencies, in this case not allow the xsubj dependency to be added to the graph.

0 讨论(0)