问题
I need my scrapy to move on to the next page please give me the correct code for the rule,how to write it??
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from delh.items import DelhItem
class criticspider(CrawlSpider):
name ="delh"
allowed_domains =["consumercomplaints.in"]
#start_urls =["http://www.consumercomplaints.in/?search=delhivery&page=2","http://www.consumercomplaints.in/?search=delhivery&page=3","http://www.consumercomplaints.in/?search=delhivery&page=4","http://www.consumercomplaints.in/?search=delhivery&page=5","http://www.consumercomplaints.in/?search=delhivery&page=6","http://www.consumercomplaints.in/?search=delhivery&page=7","http://www.consumercomplaints.in/?search=delhivery&page=8","http://www.consumercomplaints.in/?search=delhivery&page=9","http://www.consumercomplaints.in/?search=delhivery&page=10","http://www.consumercomplaints.in/?search=delhivery&page=11"]
start_urls=["http://www.consumercomplaints.in/?search=delhivery"]
rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@class="pagelinks"]/a/@href',)),
callback="parse_gen", follow= True),
)
def parse_gen(self,response):
hxs = Selector(response)
sites = hxs.select('//table[@width="100%"]')
items = []
for site in sites:
item = DelhItem()
item['title'] = site.select('.//td[@class="complaint"]/a/span/text()').extract()
item['content'] = site.select('.//td[@class="compl-text"]/div/text()').extract()
items.append(item)
return items
spider=criticspider()
回答1:
From my understanding you are trying to scrape two sorts of pages, hence you should use two distincts rules :
- paginated list pages, containing links to n items pages and to subsequent list pages
- items pages, from which you scrape your items
Your rules should then look something like :
rules = (
Rule(LinkExtractor(restrict_xpaths='{{ item selector }}'), callback='parse_gen'),
Rule(LinkExtractor(restrict_xpaths='//div[@class="pagelinks"]/a[contains(text(), "Next")]/@href')),
)
Explanations :
- The first rules matches item links and uses your item parsing method (
parse_gen
) as callback. The resulting responses do not go through these rules again. - the second rule matches "pagelinks" and does not specify a callback, the resulting responses will then be handled by these rules.
Notice :
SgmlLinkExtractor
is obsolete and you should useLxmlLinkExtractor
(or its aliasLinkExtractor
) instead (source)- The order in which you send out your requests does matter and, in this sort of situation (scraping an unknown, potentially large, amount of pages/items), you should seek to reduce the number of pages being processed at any given time. To this end I've modified your code in two ways :
- scrape the items from the current list page before requesting the next one, this is why the item rule is before the "pagelinks" rule.
- avoid crawling a page several times over, this is why I added the
[contains(text(), "Next")]
selector to the "pagelinks" rule. This way each "list page" gets requested exactly once
来源:https://stackoverflow.com/questions/28102472/scrapy-needs-to-crawl-all-the-next-links-on-website-and-move-on-to-the-next-page