selenium with scrapy for dynamic page

后端 未结 2 2145
清酒与你
清酒与你 2020-11-22 06:04

I\'m trying to scrape product information from a webpage, using scrapy. My to-be-scraped webpage looks like this:

  • starts with a product_list page with 10 produ
2条回答
  •  名媛妹妹
    2020-11-22 06:49

    If (url doesn't change between the two pages) then you should add dont_filter=True with your scrapy.Request() or scrapy will find this url as a duplicate after processing first page.

    If you need to render pages with javascript you should use scrapy-splash, you can also check this scrapy middleware which can handle javascript pages using selenium or you can do that by launching any headless browser

    But more effective and faster solution is inspect your browser and see what requests are made during submitting a form or triggering a certain event. Try to simulate the same requests as your browser sends. If you can replicate the request(s) correctly you will get the data you need.

    Here is an example :

    class ScrollScraper(Spider):
        name = "scrollingscraper"
    
        quote_url = "http://quotes.toscrape.com/api/quotes?page="
        start_urls = [quote_url + "1"]
    
        def parse(self, response):
            quote_item = QuoteItem()
            print response.body
            data = json.loads(response.body)
            for item in data.get('quotes', []):
                quote_item["author"] = item.get('author', {}).get('name')
                quote_item['quote'] = item.get('text')
                quote_item['tags'] = item.get('tags')
                yield quote_item
    
            if data['has_next']:
                next_page = data['page'] + 1
                yield Request(self.quote_url + str(next_page))
    

    When pagination url is same for every pages & uses POST request then you can use scrapy.FormRequest() instead of scrapy.Request(), both are same but FormRequest adds a new argument (formdata=) to the constructor.

    Here is another spider example form this post:

    class SpiderClass(scrapy.Spider):
        # spider name and all
        name = 'ajax'
        page_incr = 1
        start_urls = ['http://www.pcguia.pt/category/reviews/#paginated=1']
        pagination_url = 'http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php'
    
        def parse(self, response):
    
            sel = Selector(response)
    
            if self.page_incr > 1:
                json_data = json.loads(response.body)
                sel = Selector(text=json_data.get('content', ''))
    
            # your code here
    
            # pagination code starts here
            if sel.xpath('//div[@class="panel-wrapper"]'):
                self.page_incr += 1
                formdata = {
                    'sorter': 'recent',
                    'location': 'main loop',
                    'loop': 'main loop',
                    'action': 'sort',
                    'view': 'grid',
                    'columns': '3',
                    'paginated': str(self.page_incr),
                    'currentquery[category_name]': 'reviews'
                }
                yield FormRequest(url=self.pagination_url, formdata=formdata, callback=self.parse)
            else:
                return
    

提交回复
热议问题