How to use XPath or xgrep to find information in Wikipedia?

半腔热情 提交于 2019-12-24 09:24:22

问题


I'd like to scrape some (not much) info from Wikipedia. Say I have a list of Universities and their Wikipedia page. Can I use an xpath expression to find the website (domain) of that University?

So for instance, if I get the page

curl http://en.wikipedia.org/wiki/Vienna_University_of_Technology 

this xpath expression should find the domain:

http://www.tuwien.ac.at

Ideally, this should work with the Linux xgrep command line tool, or equivalent.


回答1:


With h prefix bound to http://www.w3.org/1999/xhtml namespace URI:

/h:html/h:body/h:div[@id='content']
               /h:div[@id='bodyContent']
                /h:table[@class='infobox vcard']
                 /h:tr[h:th='Website']
                  /h:td/h:a/@href

Also, it looks like Wiki page are well formed XML (despite the fact that are served like text/html). So, if you have an XML document with the pages URLs like:

<root>
   <url>http://en.wikipedia.org/wiki/Vienna_University_of_Technology</url>
</root>

You could use:

document(/root/url)/h:html/h:body/h:div[@id='content']
                                  /h:div[@id='bodyContent']
                                   /h:table[@class='infobox vcard']
                                    /h:tr[h:th='Website']
                                     /h:td/h:a/@href


来源:https://stackoverflow.com/questions/4509191/how-to-use-xpath-or-xgrep-to-find-information-in-wikipedia

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!