Scrapy Not Returning Additonal Info from Scraped Link in Item via Request Callback

后端未结

关注

 3  561

别跟我提以往 2021-01-24 03:09

Basically the code below scrapes the first 5 items of a table. One of the fields is another href and clicking on that href provides more info which I want to collect and add to

3条回答

醉梦人生 (楼主)

2021-01-24 03:22
Sorry about the SSL and Fiddler things.. they were not meant for you. I mixed two answers here.. :p Now come to your code, you said

Running the code below only returns the info collected in parse

that's right because you are returning a list of 5 items populated with the 'thing1' and 'thing2' returning item here will not cause scrapy engine to send the request to the call back 'parse_next_page' as shown below.
```
for x in range (1,6):
    item = ScrapyItem()
    str_selector = '//tr[@name="row{0}"]'.format(x)
    item['thing1'] = hxs.select(str_selector")]/a/text()').extract()
    item['thing2'] = hxs.select(str_selector")]/a/@href').extract()
    print 'hello'
    request = Request("www.nextpage.com", callback=self.parse_next_page,meta={'item':item})
    print 'hello2'
    request.meta['item'] = item
    items.append(item)      

return items
```
then you said...
```
If I change the return items to return request I get a completed item with all 3 "things" but I only get 1 of the rows, not all 5. 
```
that's also true because you are using 'return request' outside the loop which executes only last request created in loop and not the first 4. So either make a 'list of requests' and return in outside the loop or use 'yield request' inside the loop.. this should work definitely as I have tested same case myself. Returning items inside the parse will not retrieve the 'thing3'.

simply apply any one solution and your spider should run like missile....
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...