I have hard time to understand scrapy crawl spider rules. I have example that doesn\'t work as I would like it did, so it can be two things:
If you are from china, I have a chinese blog post about this:
别再滥用scrapy CrawlSpider中的follow=True
Let's check out how the rules work under the hood:
def _requests_to_follow(self, response):
seen = set()
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
for link in links:
seen.add(link)
r = Request(url=link.url, callback=self._response_downloaded)
yield r
as you can see, when we follow a link, the link in the response is extracted by all the rule using a for loop then add them to a set object.
and all the response will be handled by self._response_downloaded
:
def _response_downloaded(self, response):
rule = self._rules[response.meta['rule']]
return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)
def _parse_response(self, response, callback, cb_kwargs, follow=True):
if callback:
cb_res = callback(response, **cb_kwargs) or ()
cb_res = self.process_results(response, cb_res)
for requests_or_item in iterate_spider_output(cb_res):
yield requests_or_item
# follow will go back to the rules again
if follow and self._follow_links:
for request_or_item in self._requests_to_follow(response):
yield request_or_item
and it goes back to the self._requests_to_follow(response)
again and again.
In summary:
You are right, according to the source code before returning each response to the callback function, the crawler loops over the Rules, starting, from the first. You should have it in mind, when you write the rules. For example the following rules:
rules(
Rule(SgmlLinkExtractor(allow=(r'/items',)), callback='parse_item',follow=True),
Rule(SgmlLinkExtractor(allow=(r'/items/electronics',)), callback='parse_electronic_item',follow=True),
)
The second rule will never be applied since all the links will be extracted by the first rule with parse_item callback. The matches for the second rule will be filtered out as duplicates by the scrapy.dupefilter.RFPDupeFilter
. You should use deny for correct matching of links:
rules(
Rule(SgmlLinkExtractor(allow=(r'/items',)), deny=(r'/items/electronics',), callback='parse_item',follow=True),
Rule(SgmlLinkExtractor(allow=(r'/items/electronics',)), callback='parse_electronic_item',follow=True),
)
I would be tempted to use a BaseSpider scraper instead of a crawler. Using a basespider you can have more of a flow of intended request routes instead of finding ALL hrefs on the page and visiting them based on global rules. Use yield Requests() to continue looping through the parent sets of links and callbacks to pass the output object all the way to the end.
From your description:
I think that crawler should work something like this: That rules crawler is something like a loop. When first link is matched the crawler will follow to the "Step 2" page, than to "step 3" and after that it will extract data. After doing that it will return to "step 1" to match second link and start loop again to the point when there is no links in first step.
A request callback stack like this would suit you very well. Since you know the order of the pages and which pages you need to scrape. This also has the added benefit of being able to collect information over multiple pages before returning the output object to be processed.
class Basketspider(BaseSpider, errorLog):
name = "basketsp_test"
download_delay = 0.5
def start_requests(self):
item = WhateverYourOutputItemIs()
yield Request("http://www.euroleague.net/main/results/by-date", callback=self.parseSeasonsLinks, meta={'item':item})
def parseSeaseonsLinks(self, response):
item = response.meta['item']
hxs = HtmlXPathSelector(response)
html = hxs.extract()
roundLinkList = list()
roundLinkPttern = re.compile(r'http://www\.euroleague\.net/main/results/by-date\?gamenumber=\d+&phasetypecode=RS')
for (roundLink) in re.findall(roundLinkPttern, html):
if roundLink not in roundLinkList:
roundLinkList.append(roundLink)
for i in range(len(roundLinkList)):
#if you wanna output this info in the final item
item['RoundLink'] = roundLinkList[i]
# Generate new request for round page
yield Request(stockpageUrl, callback=self.parseStockItem, meta={'item':item})
def parseRoundPAge(self, response):
item = response.meta['item']
#Do whatever you need to do in here call more requests if needed or return item here
item['Thing'] = 'infoOnPage'
#....
#....
#....
return item