Scrapy Not Returning Additonal Info from Scraped Link in Item via Request Callback

后端 未结 3 552
别跟我提以往
别跟我提以往 2021-01-24 03:09

Basically the code below scrapes the first 5 items of a table. One of the fields is another href and clicking on that href provides more info which I want to collect and add to

3条回答
  •  醉梦人生
    2021-01-24 03:22

    Sorry about the SSL and Fiddler things.. they were not meant for you. I mixed two answers here.. :p Now come to your code, you said

    Running the code below only returns the info collected in parse

    that's right because you are returning a list of 5 items populated with the 'thing1' and 'thing2' returning item here will not cause scrapy engine to send the request to the call back 'parse_next_page' as shown below.

    for x in range (1,6):
        item = ScrapyItem()
        str_selector = '//tr[@name="row{0}"]'.format(x)
        item['thing1'] = hxs.select(str_selector")]/a/text()').extract()
        item['thing2'] = hxs.select(str_selector")]/a/@href').extract()
        print 'hello'
        request = Request("www.nextpage.com", callback=self.parse_next_page,meta={'item':item})
        print 'hello2'
        request.meta['item'] = item
        items.append(item)      
    
    return items
    

    then you said...

    If I change the return items to return request I get a completed item with all 3 "things" but I only get 1 of the rows, not all 5. 
    

    that's also true because you are using 'return request' outside the loop which executes only last request created in loop and not the first 4. So either make a 'list of requests' and return in outside the loop or use 'yield request' inside the loop.. this should work definitely as I have tested same case myself. Returning items inside the parse will not retrieve the 'thing3'.

    simply apply any one solution and your spider should run like missile....

提交回复
热议问题