How to speed up program that finds the shortest path between two wikipedia articles

问题

I recently coded a program that finds the shortest path between two wikipedia articles. The problem is getting ALL the links from a page and putting them into a graph is taking a long time. Finding the path is the easy part. Basicly what I'm doing is this:

startingPage = 'Lisbon'
target = 'Adolf Hitler'
graph = nx.DiGraph()
graph.add_node(startingPage)
found = pages.import_page(graph, startingPage)

while found != True:
    for node in list(graph):
        if graph.out_degree(node) == 0:
            found = pages.import_page(graph, node)
        if found == True:
            break;

And my import_page function is this one:

def import_page(graph, starting, target):
    general_str = 'https://en.wikipedia.org/w/api.php?action=query&prop=links&pllimit=max&format=json&titles='
    data_str = general_str + starting
    encoded_data_str = data_str.encode('utf-8') #Sanitize input
    response = url.urlopen(encoded_data_str)
    data = json.loads(response.read())
    pageId = data['query']['pages'].keys()
    print starting
    if data['query']['pages'].keys()[0] == '-1': #Check if the page doesn't exist in Wikipedia
        return False
    elif data['query']['pages'][pageId[0]].keys()[2] != 'links': #Check if the page has no links in it
        return False

    for jsonObject in data['query']['pages'][pageId[0]]['links']:

        graph.add_node(jsonObject['title'])
        graph.add_edge(starting, jsonObject['title'])

        if jsonObject['title'] == target:
            return True

    while data.keys()[0] != 'batchcomplete':

        continueId = data['continue']['plcontinue']
        continue_str = data_str + '&plcontinue=' + continueId
        encoded_continue_str = continue_str.encode('utf-8') #Sanitize input
        response = url.urlopen(encoded_continue_str)
        data = json.loads(response.read())

        for jsonObject in data['query']['pages'][pageId[0]]['links']:
            graph.add_node(jsonObject['title'])
            graph.add_edge(starting, jsonObject['title'])
            if jsonObject['title'] == target:
                return True

    return False

The problem is for any distance bigger than 2/3 links it is a taking an immense amount of time. Any ideas on how I can speed it up?

回答1:

I used an approach as @Tgr point out, exploiting a small world. If you use a weighted network, you can limit the search to a subgraph sufficiently large to encompass relevant hubs, and small enough to be handled in a web RESTful API.

You may want to check out iGraph module rather networkx, for less memory footprint.

With the approach I suggested to you I have been able to obtain shortest paths connecting up to 5 queried wikipedia articles, with a memory footprint of up to 100MB of sub-graph created in real time. A shortest path between two topics takes less than 1s.

I would be happy to share a link to my project, which actually compute a weighted knowledge networks for the wikipedia to allow search for connections between multiple topics - would it break SO policy or could be useful for the OP and discussion over his question?

EDIT

Thank you @Tgr for debriefing on the policy.

Nifty.works is a prototype platform to search for connections between inter-disciplinary fields. The knowledge graph is a subset of Wikidata paired with English Wikipedia.

As an example for the OP, this example shows shortest paths queried between five Wikipedia articles: subgraph for connections between articles: "Shortest Path Problem", "A star search", "networkx", "knowledge graph" and "semantic network"

I computed the knowledge graph of Wikipedia as a weighted network. The network has small-world properties. A query for connections (paths) in between of articles is made by delimiting a portion of the knowledge graph (a sub-graph).

With this approach it is possible to serve a graph search fast enough to provide insights in knowledge discovery, even on small server machines.

Here you find examples of gamification of shortest paths between two articles of English Wikipedia, each pair has a distance bigger than 3 links - that is, they are not first neighbours: e.g. "Machine Learning" and "Life" -here a json of the queried subgraph).

You might even want to add parameters to adjust the size of your weighted sub-graph, so to obtain different results. As an example, see the differences between:

machine learning - life: query of shortest paths on weighted subgraph of the knowledge graph of English Wikipedia (small-world network) (1)
machine learning - life: query of shortest paths on weighted subgraph of the knowledge graph of English Wikipedia (small-world network) (2)

Finally, also look at this question: https://stackoverflow.com/a/16030045/305883

回答2:

Finding the shortest path with certainty is practically impossible with a simple algorithm and a web API. If the shortest path has N steps, you need to walk every possible path with length N-1 or less to be sure. With millions of articles and dozens to hundreds of links from each, that's unfeasible unless you are really lucky and the shortest path is just 1-2 links. If it's say 10 steps away, you'd have to make billions of requests, which would take years.

If you just want to find a reasonably short path most of the time, you can try something like an A* search algorithm with a good heuristic. For example, you could hypothesize some sort of small-world property and try to identify topic hubs which are close to other topic hubs and also to all articles in that topic. Or you can score candidates on being on the same topic, or in the same historic period as the target.

来源：https://stackoverflow.com/questions/40877495/how-to-speed-up-program-that-finds-the-shortest-path-between-two-wikipedia-artic

标签

python

algorithm

graph

networkx

wikipedia-api