I\'m trying to crawl about a thousand of web sites, from which I\'m interested in the html content only.
Then I transform the HTML into XML to be parsed with Xpath to ex
I would suggest writing your own using Python with the Scrapy and either lxml or BeautifulSoup packages. You should find a few good tutorials in Google for those. I use Scrapy+lxml at work to spider ~600 websites checking for broken links.