问题
I want to make a program that will retrieve some information a url. For example i give the url below, from librarything
How can i retrieve all the words below the "TAGS" tab, like
Black Library fantasy Thanquol & Boneripper Thanquol and Bone Ripper Warhammer ?
I am thinking of using java, and design a data mining wrapper, but i am not sure how to start. Can anyone give me some advice?
EDIT: You gave me excellent help, but I want to ask something else. For every tag we can see how many times each tag has been used, when we press the "number" button. How can I retrieve that number also?
回答1:
You could use a HTML parser like Jsoup. It allows you to select HTML elements of interest using simple CSS selectors:
E.g.
Document document = Jsoup.connect("http://www.librarything.com/work/9767358/78536487").get();
Elements tags = document.select(".tags .tag a");
for (Element tag : tags) {
System.out.println(tag.text());
}
which prints
Black Library
fantasy
Thanquol & Boneripper
Thanquol and Bone Ripper
Warhammer
Please note that you should read website's robots.txt
-if any- and read the website's terms of service -if any- or your server might be IP-banned sooner or later.
回答2:
I've done this before using PHP with a page scrape, then parsing the HTML as a string using Regular Expressions.
Example here
I imagine there's something similar in java and other languages. The concept would be similar:
- Load page data.
- Parse the data, (i.e. with a regex, or via the DOM model and using some CSS selectors or some XPath selectors.
- Do what you want with the data :)
It's worth remembering that some people might not appreciate you data mining their site and profiting / redistrubuting it on a large scale.
来源:https://stackoverflow.com/questions/7822420/retrieve-information-from-a-url