Scrapy Not Returning Additonal Info from Scraped Link in Item via Request Callback

后端 未结 3 556
别跟我提以往
别跟我提以往 2021-01-24 03:09

Basically the code below scrapes the first 5 items of a table. One of the fields is another href and clicking on that href provides more info which I want to collect and add to

相关标签:
3条回答
  • 2021-01-24 03:21

    Oh.. yarr.. change the code into this..

    def parse(self, response):
    hxs = HtmlXPathSelector(response)
    items = []
    
    for x in range (1,6):
        item = ScrapyItem()
        str_selector = '//tr[@name="row{0}"]'.format(x)
        item['thing1'] = hxs.select(str_selector")]/a/text()').extract()
        item['thing2'] = hxs.select(str_selector")]/a/@href').extract()
        print 'hello'
        request = Request("www.nextpage.com", callback=self.parse_next_page,meta={'item':item})
        print 'hello2'
        yield request
        #donot return or yield item here.. only yield request return item in the callback.
    
    
    def parse_next_page(self, response):
        print 'stuff'
        hxs = HtmlXPathSelector(response)
        item = response.meta['item']
        item['thing3'] = hxs.select('//div/ul/li[1]/span[2]/text()').extract()
        return item
    

    I think now its pretty clear...

    0 讨论(0)
  • 2021-01-24 03:22

    Sorry about the SSL and Fiddler things.. they were not meant for you. I mixed two answers here.. :p Now come to your code, you said

    Running the code below only returns the info collected in parse

    that's right because you are returning a list of 5 items populated with the 'thing1' and 'thing2' returning item here will not cause scrapy engine to send the request to the call back 'parse_next_page' as shown below.

    for x in range (1,6):
        item = ScrapyItem()
        str_selector = '//tr[@name="row{0}"]'.format(x)
        item['thing1'] = hxs.select(str_selector")]/a/text()').extract()
        item['thing2'] = hxs.select(str_selector")]/a/@href').extract()
        print 'hello'
        request = Request("www.nextpage.com", callback=self.parse_next_page,meta={'item':item})
        print 'hello2'
        request.meta['item'] = item
        items.append(item)      
    
    return items
    

    then you said...

    If I change the return items to return request I get a completed item with all 3 "things" but I only get 1 of the rows, not all 5. 
    

    that's also true because you are using 'return request' outside the loop which executes only last request created in loop and not the first 4. So either make a 'list of requests' and return in outside the loop or use 'yield request' inside the loop.. this should work definitely as I have tested same case myself. Returning items inside the parse will not retrieve the 'thing3'.

    simply apply any one solution and your spider should run like missile....

    0 讨论(0)
  • 2021-01-24 03:30

    Install pyOpenSSL , sometimes fiddler also creates problem for "https:\*" requests. Close fiddler if running and run spider again. Another problem which is in your code that you are using a generator in parse method and not using 'yeild' to return the request to scrapy scheduler. You should do it like this....

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        items = []
    
    for x in range (1,6):
        item = ScrapyItem()
        str_selector = '//tr[@name="row{0}"]'.format(x)
        item['thing1'] = hxs.select(str_selector")]/a/text()').extract()
        item['thing2'] = hxs.select(str_selector")]/a/@href').extract()
        print 'hello'
        request = Request("www.nextpage.com",callback=self.parse_next_page,meta{'item':item})
        if request:
             yield request
        else:
             yield item
    
    0 讨论(0)
提交回复
热议问题