Crawler in Groovy (JSoup VS Crawler4j)

前端 未结 1 1641
悲哀的现实
悲哀的现实 2021-02-09 01:02

I wish to develop a web crawler in Groovy(using Grails framework and MongoDB database) that has the ability to crawl a website, creating a list of site URLs and their resource t

1条回答
  •  挽巷
    挽巷 (楼主)
    2021-02-09 01:32

    Crawler4J is a crawler, Jsoup is a parser. Actually you could/should use both. Crawler4J is an easy-multithreaded interface to get all the urls and all the pages(content) of the site you want. After that you can use Jsoup in order to parse the data, with amazing (jquery-like) css selectors and actually do something with it. Of course you have to consider dynamic (javascript generated) content. If you want that content too, then you have to use something else that includes a javascript engine (headless browser + parser) like htmlunit or webdriver (selenium), that will execute javascript before parsing the content.

    0 讨论(0)
提交回复
热议问题