Scrapy Not Returning Additonal Info from Scraped Link in Item via Request Callback

后端未结

关注

 3  559

Basically the code below scrapes the first 5 items of a table. One of the fields is another href and clicking on that href provides more info which I want to collect and add to

相关标签:

3条回答

我在风中等你

2021-01-24 03:21

Oh.. yarr.. change the code into this..

def parse(self, response):
hxs = HtmlXPathSelector(response)
items = []

for x in range (1,6):
    item = ScrapyItem()
    str_selector = '//tr[@name="row{0}"]'.format(x)
    item['thing1'] = hxs.select(str_selector")]/a/text()').extract()
    item['thing2'] = hxs.select(str_selector")]/a/@href').extract()
    print 'hello'
    request = Request("www.nextpage.com", callback=self.parse_next_page,meta={'item':item})
    print 'hello2'
    yield request
    #donot return or yield item here.. only yield request return item in the callback.


def parse_next_page(self, response):
    print 'stuff'
    hxs = HtmlXPathSelector(response)
    item = response.meta['item']
    item['thing3'] = hxs.select('//div/ul/li[1]/span[2]/text()').extract()
    return item

I think now its pretty clear...

0 讨论(0)

醉梦人生

2021-01-24 03:22
Sorry about the SSL and Fiddler things.. they were not meant for you. I mixed two answers here.. :p Now come to your code, you said

Running the code below only returns the info collected in parse

that's right because you are returning a list of 5 items populated with the 'thing1' and 'thing2' returning item here will not cause scrapy engine to send the request to the call back 'parse_next_page' as shown below.
```
for x in range (1,6):
    item = ScrapyItem()
    str_selector = '//tr[@name="row{0}"]'.format(x)
    item['thing1'] = hxs.select(str_selector")]/a/text()').extract()
    item['thing2'] = hxs.select(str_selector")]/a/@href').extract()
    print 'hello'
    request = Request("www.nextpage.com", callback=self.parse_next_page,meta={'item':item})
    print 'hello2'
    request.meta['item'] = item
    items.append(item)      

return items
```
then you said...
```
If I change the return items to return request I get a completed item with all 3 "things" but I only get 1 of the rows, not all 5. 
```
that's also true because you are using 'return request' outside the loop which executes only last request created in loop and not the first 4. So either make a 'list of requests' and return in outside the loop or use 'yield request' inside the loop.. this should work definitely as I have tested same case myself. Returning items inside the parse will not retrieve the 'thing3'.

simply apply any one solution and your spider should run like missile....
0 讨论(0)
发布评论:

提交评论
- 加载中...

情话喂你

2021-01-24 03:30

Install pyOpenSSL , sometimes fiddler also creates problem for "https:\*" requests. Close fiddler if running and run spider again. Another problem which is in your code that you are using a generator in parse method and not using 'yeild' to return the request to scrapy scheduler. You should do it like this....

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    items = []

for x in range (1,6):
    item = ScrapyItem()
    str_selector = '//tr[@name="row{0}"]'.format(x)
    item['thing1'] = hxs.select(str_selector")]/a/text()').extract()
    item['thing2'] = hxs.select(str_selector")]/a/@href').extract()
    print 'hello'
    request = Request("www.nextpage.com",callback=self.parse_next_page,meta{'item':item})
    if request:
         yield request
    else:
         yield item

0 讨论(0)