How to send JavaScript and Cookies Enabled in Scrapy?

后端 未结 3 963
深忆病人
深忆病人 2021-02-08 19:57

I am scraping a website using Scrapy which require cooking and java-script to be enabled. I don\'t think I will have to actually process javascript. All I need is to pretend as

相关标签:
3条回答
  • 2021-02-08 20:03

    AFAIK, there is no a universal solution. You have to debug the site, to see how it determines that Javascript is not supported/enabled by your client.

    I don't think the server looks at X-JAVASCRIPT-ENABLED header. Maybe there is a cookie set by Javascript when the page loads in a real javascript enabled browser? Maybe the server looks at user-agent header?

    See also this response.

    0 讨论(0)
  • 2021-02-08 20:09

    Scrapy doesn't support java script.

    but

    you can use some other library with Scrapy for executing JS , like Webkit, Selenium etc,

    and you don't needs to enable cookies (COOKIES_ENABLED = True), not even required to add DOWNLOADER_MIDDLEWARES in your settings.py because they are already available in default scrapy settings

    0 讨论(0)
  • 2021-02-08 20:10

    You should try Splash JS engine with scrapyjs. Here is a example of how to set it up in your spider project:

    SPLASH_URL = 'http://192.168.59.103:8050'
    DOWNLOADER_MIDDLEWARES = {
        'scrapyjs.SplashMiddleware': 725,
    }
    

    Scraping hub which is the same company behind Scrapy, has special instances to run your spiders with splash enabled.

    Then yield SplashRequest instead of Request in your spider like this:

    import scrapy
    from scrapy_splash import SplashRequest
    
    class MySpider(scrapy.Spider):
        start_urls = ["http://example.com", "http://example.com/foo"]
    
        def start_requests(self):
            for url in self.start_urls:
                yield SplashRequest(url, self.parse,
                    endpoint='render.html',
                    args={'wait': 0.5},
                )
    
        def parse(self, response):
            # response.body is a result of render.html call; it
            # contains HTML processed by a browser.
            # …
    
    0 讨论(0)
提交回复
热议问题