问题
I'd like to scrape some (not much) info from Wikipedia. Say I have a list of Universities and their Wikipedia page. Can I use an xpath expression to find the website (domain) of that University?
So for instance, if I get the page
curl http://en.wikipedia.org/wiki/Vienna_University_of_Technology
this xpath expression should find the domain:
http://www.tuwien.ac.at
Ideally, this should work with the Linux xgrep
command line tool, or equivalent.
回答1:
With h
prefix bound to http://www.w3.org/1999/xhtml
namespace URI:
/h:html/h:body/h:div[@id='content']
/h:div[@id='bodyContent']
/h:table[@class='infobox vcard']
/h:tr[h:th='Website']
/h:td/h:a/@href
Also, it looks like Wiki page are well formed XML (despite the fact that are served like text/html). So, if you have an XML document with the pages URLs like:
<root>
<url>http://en.wikipedia.org/wiki/Vienna_University_of_Technology</url>
</root>
You could use:
document(/root/url)/h:html/h:body/h:div[@id='content']
/h:div[@id='bodyContent']
/h:table[@class='infobox vcard']
/h:tr[h:th='Website']
/h:td/h:a/@href
来源:https://stackoverflow.com/questions/4509191/how-to-use-xpath-or-xgrep-to-find-information-in-wikipedia