retrieve information from a url

倾然丶 夕夏残阳落幕 提交于 2019-12-10 11:08:50

问题


I want to make a program that will retrieve some information a url. For example i give the url below, from librarything

How can i retrieve all the words below the "TAGS" tab, like

Black Library fantasy Thanquol & Boneripper Thanquol and Bone Ripper Warhammer ?

I am thinking of using java, and design a data mining wrapper, but i am not sure how to start. Can anyone give me some advice?

EDIT: You gave me excellent help, but I want to ask something else. For every tag we can see how many times each tag has been used, when we press the "number" button. How can I retrieve that number also?


回答1:


You could use a HTML parser like Jsoup. It allows you to select HTML elements of interest using simple CSS selectors:

E.g.

Document document = Jsoup.connect("http://www.librarything.com/work/9767358/78536487").get();
Elements tags = document.select(".tags .tag a");

for (Element tag : tags) {
    System.out.println(tag.text());
}

which prints

Black Library
fantasy
Thanquol & Boneripper
Thanquol and Bone Ripper
Warhammer

Please note that you should read website's robots.txt -if any- and read the website's terms of service -if any- or your server might be IP-banned sooner or later.




回答2:


I've done this before using PHP with a page scrape, then parsing the HTML as a string using Regular Expressions.

Example here

I imagine there's something similar in java and other languages. The concept would be similar:

  1. Load page data.
  2. Parse the data, (i.e. with a regex, or via the DOM model and using some CSS selectors or some XPath selectors.
  3. Do what you want with the data :)

It's worth remembering that some people might not appreciate you data mining their site and profiting / redistrubuting it on a large scale.



来源:https://stackoverflow.com/questions/7822420/retrieve-information-from-a-url

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!