Find shortest path between two articles in english Wikipedia in Python

后端 未结 2 646
天涯浪人
天涯浪人 2021-02-14 00:43

The question:

Find shortest path between two articles in english Wikipedia. Path between article A and B exist if there are articles C(i) and there is a

相关标签:
2条回答
  • 2021-02-14 01:29

    We are looking at graph exploration... why should you be considering Dijkstra's algorithm??? IMHO... change the approach.

    First, you need a good heuristic function. For every node you expand, you need to geusstimate the distance of that node from the target/goal node. Now... how you compute the heuristic is the real challenge here. You may perhaps do a keyword mapping between the current wiki page and your destination page. A percentage of match may give you the estimate. Or... try to guess the relevance of content between the two pages. I have a hunch... perhaps a Neural Network may help you here. But, this may not indicate optimal estimate either. I'm not sure. Once you figure out a suitable way of doing this, use A* search algorithm.

    Search and explore the heuristic function, do not go for breadth first search, you'll end up no where in the vast wide world of wikipedia!

    0 讨论(0)
  • 2021-02-14 01:33

    Given the number of articles on wikipedia, it would take a unaffordable time to compute THE shortest (my assumption - I haven't tried).

    The real problem is to find an acceptable and efficent short path between two articles.

    Algorithms that deal with this kind problem are related to The travelling salesman problem. It could be a good point to start from.

    IIRC google or yahoo bots use Ant Colony Optimization to get the shortest acceptable in optimized time. You could check this SO question: Where can I learn more about "ant colony" optimizations?

    I'm personnally also fond of the genetic algorithms approach to find an acceptable optimum in a certain amount of time.


    I have just looked at that image and that sets the number of articles to 4.000.000 for en.wikipedia.com in 2013. Much less than I thought indeed.

    EDIT: I first stated it was a NP-Hard problem and commenters explain it's not.

    0 讨论(0)
提交回复
热议问题