Basically the code below scrapes the first 5 items of a table. One of the fields is another href and clicking on that href provides more info which I want to collect and add to
Oh.. yarr.. change the code into this..
def parse(self, response):
hxs = HtmlXPathSelector(response)
items = []
for x in range (1,6):
item = ScrapyItem()
str_selector = '//tr[@name="row{0}"]'.format(x)
item['thing1'] = hxs.select(str_selector")]/a/text()').extract()
item['thing2'] = hxs.select(str_selector")]/a/@href').extract()
print 'hello'
request = Request("www.nextpage.com", callback=self.parse_next_page,meta={'item':item})
print 'hello2'
yield request
#donot return or yield item here.. only yield request return item in the callback.
def parse_next_page(self, response):
print 'stuff'
hxs = HtmlXPathSelector(response)
item = response.meta['item']
item['thing3'] = hxs.select('//div/ul/li[1]/span[2]/text()').extract()
return item
I think now its pretty clear...
Sorry about the SSL and Fiddler things.. they were not meant for you. I mixed two answers here.. :p Now come to your code, you said
Running the code below only returns the info collected in parse
that's right because you are returning a list of 5 items populated with the 'thing1' and 'thing2' returning item here will not cause scrapy engine to send the request to the call back 'parse_next_page' as shown below.
for x in range (1,6):
item = ScrapyItem()
str_selector = '//tr[@name="row{0}"]'.format(x)
item['thing1'] = hxs.select(str_selector")]/a/text()').extract()
item['thing2'] = hxs.select(str_selector")]/a/@href').extract()
print 'hello'
request = Request("www.nextpage.com", callback=self.parse_next_page,meta={'item':item})
print 'hello2'
request.meta['item'] = item
items.append(item)
return items
then you said...
If I change the return items to return request I get a completed item with all 3 "things" but I only get 1 of the rows, not all 5.
that's also true because you are using 'return request' outside the loop which executes only last request created in loop and not the first 4. So either make a 'list of requests' and return in outside the loop or use 'yield request' inside the loop.. this should work definitely as I have tested same case myself. Returning items inside the parse will not retrieve the 'thing3'.
simply apply any one solution and your spider should run like missile....
Install pyOpenSSL , sometimes fiddler also creates problem for "https:\*" requests. Close fiddler if running and run spider again. Another problem which is in your code that you are using a generator in parse method and not using 'yeild' to return the request to scrapy scheduler. You should do it like this....
def parse(self, response):
hxs = HtmlXPathSelector(response)
items = []
for x in range (1,6):
item = ScrapyItem()
str_selector = '//tr[@name="row{0}"]'.format(x)
item['thing1'] = hxs.select(str_selector")]/a/text()').extract()
item['thing2'] = hxs.select(str_selector")]/a/@href').extract()
print 'hello'
request = Request("www.nextpage.com",callback=self.parse_next_page,meta{'item':item})
if request:
yield request
else:
yield item