Scraping all text using Scrapy without knowing webpages' structure

前端 未结 1 574
长发绾君心
长发绾君心 2021-02-09 05:44

I am conducting a research which relates to distributing the indexing of the internet.

While several such projects exist (IRLbot, Distributed-indexing, Cluster-Scrapy, C

1条回答
  •  慢半拍i
    慢半拍i (楼主)
    2021-02-09 05:48

    What you are looking for here is scrapy CrawlSpider

    CrawlSpider lets you define crawling rules that are followed for every page. It's smart enough to avoid crawling images, documents and other files that are not web resources and it pretty much does the whole thing for you.

    Here's a good example how your spider might look with CrawlSpider:

    from scrapy.spiders import CrawlSpider, Rule
    from scrapy.linkextractors import LinkExtractor
    
    class MySpider(CrawlSpider):
        name = 'crawlspider'
        start_urls = ['http://scrapy.org']
    
        rules = (
            Rule(LinkExtractor(), callback='parse_item', follow=True),
        )
    
        def parse_item(self, response):
            item = dict()
            item['url'] = response.url
            item['title'] = response.meta['link_text']
            # extracting basic body
            item['body'] = '\n'.join(response.xpath('//text()').extract())
            # or better just save whole source
            item['source'] = response.body
            return item
    

    This spider will crawl every webpage it can find on the website and log the title, url and whole text body.
    For text body you might want to extract it in some smarter way(to exclude javascript and other unwanted text nodes), but that's an issue on it's own to discuss. Actually for what you are describing you probably want to save full html source rather than text only, since unstructured text is useless for any sort of analitics or indexing.

    There's also bunch of scrapy settings that can be adjusted for this type of crawling. It's very nicely described in Broad Crawl docs page

    0 讨论(0)
提交回复
热议问题