Crawler in Groovy (JSoup VS Crawler4j)

前端未结

关注

 1  1645

悲哀的现实 2021-02-09 01:02

I wish to develop a web crawler in Groovy(using Grails framework and MongoDB database) that has the ability to crawl a website, creating a list of site URLs and their resource t

1条回答

挽巷 (楼主)

2021-02-09 01:32

Crawler4J is a crawler, Jsoup is a parser. Actually you could/should use both. Crawler4J is an easy-multithreaded interface to get all the urls and all the pages(content) of the site you want. After that you can use Jsoup in order to parse the data, with amazing (jquery-like) css selectors and actually do something with it. Of course you have to consider dynamic (javascript generated) content. If you want that content too, then you have to use something else that includes a javascript engine (headless browser + parser) like htmlunit or webdriver (selenium), that will execute javascript before parsing the content.

0 讨论(0)
发布评论:

提交评论
- 加载中...