I am conducting a research which relates to distributing the indexing of the internet.
While several such projects exist (IRLbot, Distributed-indexing, Cluster-Scrapy, C
What you are looking for here is scrapy CrawlSpider
CrawlSpider lets you define crawling rules that are followed for every page. It's smart enough to avoid crawling images, documents and other files that are not web resources and it pretty much does the whole thing for you.
Here's a good example how your spider might look with CrawlSpider:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'crawlspider'
start_urls = ['http://scrapy.org']
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
item = dict()
item['url'] = response.url
item['title'] = response.meta['link_text']
# extracting basic body
item['body'] = '\n'.join(response.xpath('//text()').extract())
# or better just save whole source
item['source'] = response.body
return item
This spider will crawl every webpage it can find on the website and log the title, url and whole text body.
For text body you might want to extract it in some smarter way(to exclude javascript and other unwanted text nodes), but that's an issue on it's own to discuss.
Actually for what you are describing you probably want to save full html source rather than text only, since unstructured text is useless for any sort of analitics or indexing.
There's also bunch of scrapy settings that can be adjusted for this type of crawling. It's very nicely described in Broad Crawl docs page