I try to find the dependency path between two words in Python given dependency tree.
For sentence
Robots in popular culture are there to remind u
HugoMailhot's answer is great. I'll write something similar for spacy users who want to find the shortest dependency path between two words (whereas HugoMailhot's answer relies on practNLPTools).
The sentence:
Robots in popular culture are there to remind us of the awesomeness of unbound human agency.
has the following dependency tree:
Here is the code to find the shortest dependency path between two words:
import networkx as nx
import spacy
nlp = spacy.load('en')
# https://spacy.io/docs/usage/processing-text
document = nlp(u'Robots in popular culture are there to remind us of the awesomeness of unbound human agency.', parse=True)
print('document: {0}'.format(document))
# Load spacy's dependency tree into a networkx graph
edges = []
for token in document:
# FYI https://spacy.io/docs/api/token
for child in token.children:
edges.append(('{0}-{1}'.format(token.lower_,token.i),
'{0}-{1}'.format(child.lower_,child.i)))
graph = nx.Graph(edges)
# https://networkx.github.io/documentation/networkx-1.10/reference/algorithms.shortest_paths.html
print(nx.shortest_path_length(graph, source='robots-0', target='awesomeness-11'))
print(nx.shortest_path(graph, source='robots-0', target='awesomeness-11'))
print(nx.shortest_path(graph, source='robots-0', target='agency-15'))
Output:
4
['robots-0', 'are-4', 'remind-7', 'of-9', 'awesomeness-11']
['robots-0', 'are-4', 'remind-7', 'of-9', 'awesomeness-11', 'of-12', 'agency-15']
To install spacy and networkx:
sudo pip install networkx
sudo pip install spacy
sudo python -m spacy.en.download parser # will take 0.5 GB
Some benchmarks regarding spacy's dependency parsing: https://spacy.io/docs/api/
This answer relies on Stanford CoreNLP to obtain the dependency tree of a sentence. It borrows quite some code from HugoMailhot's answer when using networkx.
Before running the code, one needs to:
sudo pip install pycorenlp
(python interface for Stanford CoreNLP)Start a Stanford CoreNLP server as follows:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 50000
Then one can run the following code to find the shortest dependency path between two words:
import networkx as nx
from pycorenlp import StanfordCoreNLP
from pprint import pprint
nlp = StanfordCoreNLP('http://localhost:{0}'.format(9000))
def get_stanford_annotations(text, port=9000,
annotators='tokenize,ssplit,pos,lemma,depparse,parse'):
output = nlp.annotate(text, properties={
"timeout": "10000",
"ssplit.newlineIsSentenceBreak": "two",
'annotators': annotators,
'outputFormat': 'json'
})
return output
# The code expects the document to contains exactly one sentence.
document = 'Robots in popular culture are there to remind us of the awesomeness of'\
'unbound human agency.'
print('document: {0}'.format(document))
# Parse the text
annotations = get_stanford_annotations(document, port=9000,
annotators='tokenize,ssplit,pos,lemma,depparse')
tokens = annotations['sentences'][0]['tokens']
# Load Stanford CoreNLP's dependency tree into a networkx graph
edges = []
dependencies = {}
for edge in annotations['sentences'][0]['basic-dependencies']:
edges.append((edge['governor'], edge['dependent']))
dependencies[(min(edge['governor'], edge['dependent']),
max(edge['governor'], edge['dependent']))] = edge
graph = nx.Graph(edges)
#pprint(dependencies)
#print('edges: {0}'.format(edges))
# Find the shortest path
token1 = 'Robots'
token2 = 'awesomeness'
for token in tokens:
if token1 == token['originalText']:
token1_index = token['index']
if token2 == token['originalText']:
token2_index = token['index']
path = nx.shortest_path(graph, source=token1_index, target=token2_index)
print('path: {0}'.format(path))
for token_id in path:
token = tokens[token_id-1]
token_text = token['originalText']
print('Node {0}\ttoken_text: {1}'.format(token_id,token_text))
The output is:
document: Robots in popular culture are there to remind us of the awesomeness of unbound human agency.
path: [1, 5, 8, 12]
Node 1 token_text: Robots
Node 5 token_text: are
Node 8 token_text: remind
Node 12 token_text: awesomeness
Note that Stanford CoreNLP can be tested online: http://nlp.stanford.edu:8080/parser/index.jsp
This answer was tested with Stanford CoreNLP 3.6.0., pycorenlp 0.3.0 and python 3.5 x64 on Windows 7 SP1 x64 Ultimate.
Your problem can easily be conceived as a graph problem where we have to find the shortest path between two nodes.
To convert your dependency parse in a graph, we first have to deal with the fact that it comes as a string. You want to get this:
'nsubj(are-5, Robots-1)\nxsubj(remind-8, Robots-1)\namod(culture-4, popular-3)\nprep_in(Robots-1, culture-4)\nroot(ROOT-0, are-5)\nadvmod(are-5, there-6)\naux(remind-8, to-7)\nxcomp(are-5, remind-8)\ndobj(remind-8, us-9)\ndet(awesomeness-12, the-11)\nprep_of(remind-8, awesomeness-12)\namod(agency-16, unbound-14)\namod(agency-16, human-15)\nprep_of(awesomeness-12, agency-16)'
to look like this:
[('are-5', 'Robots-1'), ('remind-8', 'Robots-1'), ('culture-4', 'popular-3'), ('Robots-1', 'culture-4'), ('ROOT-0', 'are-5'), ('are-5', 'there-6'), ('remind-8', 'to-7'), ('are-5', 'remind-8'), ('remind-8', 'us-9'), ('awesomeness-12', 'the-11'), ('remind-8', 'awesomeness-12'), ('agency-16', 'unbound-14'), ('agency-16', 'human-15'), ('awesomeness-12', 'agency-16')]
This way you can feed the tuple list to a graph constructor from the networkx module that will analyze the list and build a graph for you, plus give you a neat method that gives you the length of the shortest path between two given nodes.
Necessary imports
import re
import networkx as nx
from practnlptools.tools import Annotator
How to get your string in the desired tuple list format
annotator = Annotator()
text = """Robots in popular culture are there to remind us of the awesomeness of unbound human agency."""
dep_parse = annotator.getAnnotations(text, dep_parse=True)['dep_parse']
dp_list = dep_parse.split('\n')
pattern = re.compile(r'.+?\((.+?), (.+?)\)')
edges = []
for dep in dp_list:
m = pattern.search(dep)
edges.append((m.group(1), m.group(2)))
How to build the graph
graph = nx.Graph(edges) # Well that was easy
How to compute shortest path length
print(nx.shortest_path_length(graph, source='Robots-1', target='awesomeness-12'))
This script will reveal that the shortest path given the dependency parse is actually of length 2, since you can get from Robots-1
to awesomeness-12
by going through remind-8
1. xsubj(remind-8, Robots-1)
2. prep_of(remind-8, awesomeness-12)
If you don't like this result, you might want to think about filtering some dependencies, in this case not allow the xsubj
dependency to be added to the graph.