How to crawl an entire website with Scrapy?

后端 未结 2 1116
长发绾君心
长发绾君心 2021-01-31 12:17

I\'m unable to crawl a whole website, Scrapy just crawls at the surface, I want to crawl deeper. Been googling for the last 5-6 hours and no help. My code below:



        
相关标签:
2条回答
  • 2021-01-31 12:42

    Rules short-circuit, meaning that the first rule a link satisfies will be the rule that gets applied, your second Rule (with callback) will not be called.

    Change your rules to this:

    rules = [Rule(SgmlLinkExtractor(), callback='parse_item', follow=True)]
    
    0 讨论(0)
  • 2021-01-31 12:51

    When parsing the start_urls, deeper urls can be parsed by the tag href. Then, deeper request can be yielded in the function parse(). Here is a simple example. The most important source code is shown below:

    from scrapy.spiders import Spider
    from tutsplus.items import TutsplusItem
    from scrapy.http    import Request
    import re
    
    class MySpider(Spider):
        name            = "tutsplus"
        allowed_domains = ["code.tutsplus.com"]
        start_urls      = ["http://code.tutsplus.com/"]
    
        def parse(self, response):
            links = response.xpath('//a/@href').extract()
    
            # We stored already crawled links in this list
            crawledLinks = []
    
            # Pattern to check proper link
            # I only want to get tutorial posts
            linkPattern = re.compile("^\/tutorials\?page=\d+")
    
            for link in links:
            # If it is a proper link and is not checked yet, yield it to the Spider
                if linkPattern.match(link) and not link in crawledLinks:
                    link = "http://code.tutsplus.com" + link
                    crawledLinks.append(link)
                    yield Request(link, self.parse)
    
            titles = response.xpath('//a[contains(@class, "posts__post-title")]/h1/text()').extract()
            for title in titles:
                item = TutsplusItem()
                item["title"] = title
                yield item
    
    0 讨论(0)
提交回复
热议问题