Scrapy CrawlSpider doesn't crawl the first landing page

前端 未结 2 999
别那么骄傲
别那么骄傲 2020-11-30 05:55

I am new to Scrapy and I am working on a scraping exercise and I am using the CrawlSpider. Although the Scrapy framework works beautifully and it follows the relevant links,

相关标签:
2条回答
  • 2020-11-30 06:35

    There's a number of ways of doing this, but one of the simplest is to implement parse_start_url and then modify start_urls

    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import HtmlXPathSelector
    
    class DownloadSpider(CrawlSpider):
        name = 'downloader'
        allowed_domains = ['bnt-chemicals.de']
        start_urls = ["http://www.bnt-chemicals.de/tunnel/index.htm"]
        rules = (
            Rule(SgmlLinkExtractor(allow='prod'), callback='parse_item', follow=True),
            )
        fname = 1
    
        def parse_start_url(self, response):
            return self.parse_item(response)
    
    
        def parse_item(self, response):
            open(str(self.fname)+ '.txt', 'a').write(response.url)
            open(str(self.fname)+ '.txt', 'a').write(','+ str(response.meta['depth']))
            open(str(self.fname)+ '.txt', 'a').write('\n')
            open(str(self.fname)+ '.txt', 'a').write(response.body)
            open(str(self.fname)+ '.txt', 'a').write('\n')
            self.fname = self.fname + 1
    
    0 讨论(0)
  • 2020-11-30 06:41

    Just change your callback to parse_start_url and override it:

    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    
    class DownloadSpider(CrawlSpider):
        name = 'downloader'
        allowed_domains = ['bnt-chemicals.de']
        start_urls = [
            "http://www.bnt-chemicals.de",
        ]
        rules = (
            Rule(SgmlLinkExtractor(allow='prod'), callback='parse_start_url', follow=True),
        )
        fname = 0
    
        def parse_start_url(self, response):
            self.fname += 1
            fname = '%s.txt' % self.fname
    
            with open(fname, 'w') as f:
                f.write('%s, %s\n' % (response.url, response.meta.get('depth', 0)))
                f.write('%s\n' % response.body)
    
    0 讨论(0)
提交回复
热议问题