The question:
Find shortest path between two articles in english Wikipedia. Path between article A and B exist if there are articles C(i) and there is a link in article A that leads to article C(1), in article C(1) link that leads to article C(2), ..., in article C(n) is link that leads to article B
I'm using Python. URL to download wikipedia article:
- http://en.wikipedia.org/wiki/Nazwa_artykułu
- http://en.wikipedia.org/w/index.php?title?Nazwa_artykułu&printable=yes
- Wikipedia API
I have edited my source code, but it still does not work when I include those articles in codes can any one tell me what am I messing here?
This is my code:
import urllib2
import re
import xml.etree.ElementTree as ET
text = ET.fromstring(F_D.text.encode('UTF-8'))
text = ET.fromstring(P.text.encode('UTF-8'))
F_D=requests.get('http://en.wikipedia.org/wiki/Ant_colony_optimization_algorithms')
P=requests.get('http://en.wikipedia.org/wiki/Wikipedia:Unusual_articles')
links = text.findall('.//*[@id=”mw-content-text”]/p/a')
links=E_D
E_D = graph_dict
E_D[start] = 0
for vertex in E_D:
F_D[vertex] = E_D[vertex]
if vertex == end: break
for edge in graph[vertex]:
path_distance = F_D[vertex] + graph[vertex][edge]
if edge in F_D:
if path_distance < F_D[edge]:
#raise ValueError,
elif edge not in E_D or path_distance < E_D[edge]:
E_D[edge] = path_distance
[edge] = vertex
return (F_D,P)
def Shortest_Path(graph,start,end):
F_D,P = D_Algorithm(graph,start,end)
path = []
while 1:
path.append(end)
if end == start: break
end = P[end]
path.reverse()
return path
We are looking at graph exploration... why should you be considering Dijkstra's algorithm??? IMHO... change the approach.
First, you need a good heuristic function. For every node you expand, you need to geusstimate the distance of that node from the target/goal node. Now... how you compute the heuristic is the real challenge here. You may perhaps do a keyword mapping between the current wiki page and your destination page. A percentage of match may give you the estimate. Or... try to guess the relevance of content between the two pages. I have a hunch... perhaps a Neural Network may help you here. But, this may not indicate optimal estimate either. I'm not sure. Once you figure out a suitable way of doing this, use A* search algorithm.
Search and explore the heuristic function, do not go for breadth first search, you'll end up no where in the vast wide world of wikipedia!
Given the number of articles on wikipedia, it would take a unaffordable time to compute THE shortest (my assumption - I haven't tried).
The real problem is to find an acceptable and efficent short path between two articles.
Algorithms that deal with this kind problem are related to The travelling salesman problem. It could be a good point to start from.
IIRC google or yahoo bots use Ant Colony Optimization to get the shortest acceptable in optimized time. You could check this SO question: Where can I learn more about "ant colony" optimizations?
I'm personnally also fond of the genetic algorithms approach to find an acceptable optimum in a certain amount of time.
I have just looked at that image and that sets the number of articles to 4.000.000 for en.wikipedia.com in 2013. Much less than I thought indeed.
EDIT: I first stated it was a NP-Hard problem and commenters explain it's not.
来源:https://stackoverflow.com/questions/15990639/find-shortest-path-between-two-articles-in-english-wikipedia-in-python