Scrapy needs to crawl all the next links on website and move on to the next page

问题

I need my scrapy to move on to the next page please give me the correct code for the rule,how to write it??

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from delh.items import DelhItem

class criticspider(CrawlSpider):
    name ="delh"
    allowed_domains =["consumercomplaints.in"]
    #start_urls =["http://www.consumercomplaints.in/?search=delhivery&page=2","http://www.consumercomplaints.in/?search=delhivery&page=3","http://www.consumercomplaints.in/?search=delhivery&page=4","http://www.consumercomplaints.in/?search=delhivery&page=5","http://www.consumercomplaints.in/?search=delhivery&page=6","http://www.consumercomplaints.in/?search=delhivery&page=7","http://www.consumercomplaints.in/?search=delhivery&page=8","http://www.consumercomplaints.in/?search=delhivery&page=9","http://www.consumercomplaints.in/?search=delhivery&page=10","http://www.consumercomplaints.in/?search=delhivery&page=11"]
    start_urls=["http://www.consumercomplaints.in/?search=delhivery"]
    rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@class="pagelinks"]/a/@href',)),           
              callback="parse_gen", follow= True),
    )
    def parse_gen(self,response):
        hxs = Selector(response)
        sites = hxs.select('//table[@width="100%"]')
        items = []

        for site in sites:
            item = DelhItem()
            item['title'] = site.select('.//td[@class="complaint"]/a/span/text()').extract()
            item['content'] = site.select('.//td[@class="compl-text"]/div/text()').extract()
            items.append(item)
        return items
spider=criticspider()

回答1:

From my understanding you are trying to scrape two sorts of pages, hence you should use two distincts rules :

paginated list pages, containing links to n items pages and to subsequent list pages
items pages, from which you scrape your items

Your rules should then look something like :

rules = (
    Rule(LinkExtractor(restrict_xpaths='{{ item selector }}'), callback='parse_gen'),
    Rule(LinkExtractor(restrict_xpaths='//div[@class="pagelinks"]/a[contains(text(), "Next")]/@href')),
)

Explanations :

The first rules matches item links and uses your item parsing method (parse_gen) as callback. The resulting responses do not go through these rules again.
the second rule matches "pagelinks" and does not specify a callback, the resulting responses will then be handled by these rules.

Notice :

SgmlLinkExtractor is obsolete and you should use LxmlLinkExtractor (or its alias LinkExtractor) instead (source)
The order in which you send out your requests does matter and, in this sort of situation (scraping an unknown, potentially large, amount of pages/items), you should seek to reduce the number of pages being processed at any given time. To this end I've modified your code in two ways :
- scrape the items from the current list page before requesting the next one, this is why the item rule is before the "pagelinks" rule.
- avoid crawling a page several times over, this is why I added the [contains(text(), "Next")] selector to the "pagelinks" rule. This way each "list page" gets requested exactly once

来源：https://stackoverflow.com/questions/28102472/scrapy-needs-to-crawl-all-the-next-links-on-website-and-move-on-to-the-next-page

标签

python

web-scraping

scrapy

web-crawler

scrapy-spider