问题
I'm working on a project and I need to do a lot of screen scraping to get a lot of data as fast as possible. I'm wondering if anyone knows of any good API's or resources to help me out.
I'm using java, by the way.
Here's what my workflow has been so far:
- Connect to a website (using HTTPComponents from Apache)
- Website contains a section with a bunch of links that I need to visit (using built in java HTML parsers to figure out what all the links I need to visit are, this is annoying and messy code)
- Visit all the links that I found
- For each link that I visit, there's more data that I need to extract, spread out on multiple pages so I may need to visit more links
Thoughts:
- Does anyone know of any higher level/more intelligent html parsers than the built in java one?
- Basically it's a depth first search. I imagine I would like to make this multithreaded at some time so I can visit some of these links in parallel.
- Maybe what I'm really looking for is a multithreaded web crawling library
If you haven't figured out, this is my first time messing around with this so I'm having a difficult time trying to articulate exactly what my needs are. I would greatly appreciate any input that any of you who have done this before might have.
回答1:
I've found JSoup really good for HTML parsing.
For more pointers check this article out: How to write a multi-threaded webcrawler
回答2:
I used Bixo for extracting the hyperlinks and images doing depth search,. It built over hadoop and cascading so there is a learning curve but the example provided is good enough to config the changes ...
回答3:
Try using Web-Harvest project.
回答4:
Checkout JSR-237 for Work Management, which is a cool idea when going multithreaded.
As for scraping, there are several alternatives. If ease of use is most important, I'd advise you to HTMLUnit. Beyond that, you must roll your own
来源:https://stackoverflow.com/questions/4079784/web-scraping-screen-scraping-data-mining-tips