Find shortest path between two articles in english Wikipedia in Python

喜夏-厌秋 提交于 2019-12-03 11:32:59

We are looking at graph exploration... why should you be considering Dijkstra's algorithm??? IMHO... change the approach.

First, you need a good heuristic function. For every node you expand, you need to geusstimate the distance of that node from the target/goal node. Now... how you compute the heuristic is the real challenge here. You may perhaps do a keyword mapping between the current wiki page and your destination page. A percentage of match may give you the estimate. Or... try to guess the relevance of content between the two pages. I have a hunch... perhaps a Neural Network may help you here. But, this may not indicate optimal estimate either. I'm not sure. Once you figure out a suitable way of doing this, use A* search algorithm.

Search and explore the heuristic function, do not go for breadth first search, you'll end up no where in the vast wide world of wikipedia!

Stephane Rolland

Given the number of articles on wikipedia, it would take a unaffordable time to compute THE shortest (my assumption - I haven't tried).

The real problem is to find an acceptable and efficent short path between two articles.

Algorithms that deal with this kind problem are related to The travelling salesman problem. It could be a good point to start from.

IIRC google or yahoo bots use Ant Colony Optimization to get the shortest acceptable in optimized time. You could check this SO question: Where can I learn more about "ant colony" optimizations?

I'm personnally also fond of the genetic algorithms approach to find an acceptable optimum in a certain amount of time.


I have just looked at that image and that sets the number of articles to 4.000.000 for en.wikipedia.com in 2013. Much less than I thought indeed.

EDIT: I first stated it was a NP-Hard problem and commenters explain it's not.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!