Issue with scraping JS rendered page with Scrapy and Splash

问题

I'm trying to scrape this page which includes following html according to chrome

<p class="title">

            Orange Paired

        </p>

this is my spider:

import scrapy
from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):
    name = "splash"
    allowed_domains = ["phillips.com"]
    start_urls = ["https://www.phillips.com/detail/BRIDGET-RILEY/UK010417/19"]
    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(
                url,
                self.parse,
                endpoint='render.json',
                args={'har': 1, 'html': 1}
            )
    def parse(self, response):
        print("1. PARSED", response.real_url, response.url)
        print("2. ",response.css("title").extract())
        print("3. ",response.data["har"]["log"]["pages"])
        print("4. ",response.headers.get('Content-Type'))
        print("5. ",response.xpath('//p[@class="title"]/text()').extract())

This is the output of scrapy runspider spiders/splash_spider.py

2017-08-31 09:48:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
1. PARSED http://localhost:8050/render.json https://www.phillips.com/detail/BRIDGET-RILEY/UK010417/19
2.  ['<title>PHILLIPS : Bridget Riley, Orange Paired</title>', '<title>Page 1</title>']
3.  [{'title': 'PHILLIPS : Bridget Riley, Orange Paired', 'pageTimings': {'onContentLoad': 3832, '_onStarted': 1, '_onIframesRendered': 4667, 'onLoad': 4664, '_onPrepareStart': 4664}, 'id': '1', 'startedDateTime': '2017-08-31T07:48:18.986240Z'}]
4.  b'text/html; charset=utf-8'
5.  []
2017-08-31 09:48:23 [scrapy.core.engine] INFO: Closing spider (finished)

Why am I getting an empty output for 5?

What I also don't understand is that Splash doesn't seem to render the page linked above

but it renders the top level homepage

回答1:

Good starting point in such cases is to look at FAQ section of Splash documentation. It turns out that in your case you need to disable Private mode for Splash, either via --disable-private-mode startup option for Docker, or by setting splash.private_mode_enabled = false in your LUA script.

Once you disable Private mode, the page renders correctly.

来源：https://stackoverflow.com/questions/45976331/issue-with-scraping-js-rendered-page-with-scrapy-and-splash

标签

javascript

python-3.x

web-scraping

scrapy

splash-screen