Scrapy needs to crawl all the next links on website and move on to the next page

允我心安 提交于 2019-12-24 12:22:11

问题


I need my scrapy to move on to the next page please give me the correct code for the rule,how to write it??

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from delh.items import DelhItem

class criticspider(CrawlSpider):
    name ="delh"
    allowed_domains =["consumercomplaints.in"]
    #start_urls =["http://www.consumercomplaints.in/?search=delhivery&page=2","http://www.consumercomplaints.in/?search=delhivery&page=3","http://www.consumercomplaints.in/?search=delhivery&page=4","http://www.consumercomplaints.in/?search=delhivery&page=5","http://www.consumercomplaints.in/?search=delhivery&page=6","http://www.consumercomplaints.in/?search=delhivery&page=7","http://www.consumercomplaints.in/?search=delhivery&page=8","http://www.consumercomplaints.in/?search=delhivery&page=9","http://www.consumercomplaints.in/?search=delhivery&page=10","http://www.consumercomplaints.in/?search=delhivery&page=11"]
    start_urls=["http://www.consumercomplaints.in/?search=delhivery"]
    rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@class="pagelinks"]/a/@href',)),           
              callback="parse_gen", follow= True),
    )
    def parse_gen(self,response):
        hxs = Selector(response)
        sites = hxs.select('//table[@width="100%"]')
        items = []

        for site in sites:
            item = DelhItem()
            item['title'] = site.select('.//td[@class="complaint"]/a/span/text()').extract()
            item['content'] = site.select('.//td[@class="compl-text"]/div/text()').extract()
            items.append(item)
        return items
spider=criticspider()

回答1:


From my understanding you are trying to scrape two sorts of pages, hence you should use two distincts rules :

  • paginated list pages, containing links to n items pages and to subsequent list pages
  • items pages, from which you scrape your items

Your rules should then look something like :

rules = (
    Rule(LinkExtractor(restrict_xpaths='{{ item selector }}'), callback='parse_gen'),
    Rule(LinkExtractor(restrict_xpaths='//div[@class="pagelinks"]/a[contains(text(), "Next")]/@href')),
)

Explanations :

  • The first rules matches item links and uses your item parsing method (parse_gen) as callback. The resulting responses do not go through these rules again.
  • the second rule matches "pagelinks" and does not specify a callback, the resulting responses will then be handled by these rules.

Notice :

  • SgmlLinkExtractor is obsolete and you should use LxmlLinkExtractor (or its alias LinkExtractor) instead (source)
  • The order in which you send out your requests does matter and, in this sort of situation (scraping an unknown, potentially large, amount of pages/items), you should seek to reduce the number of pages being processed at any given time. To this end I've modified your code in two ways :
    • scrape the items from the current list page before requesting the next one, this is why the item rule is before the "pagelinks" rule.
    • avoid crawling a page several times over, this is why I added the [contains(text(), "Next")] selector to the "pagelinks" rule. This way each "list page" gets requested exactly once


来源:https://stackoverflow.com/questions/28102472/scrapy-needs-to-crawl-all-the-next-links-on-website-and-move-on-to-the-next-page

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!