Scrapy-Splash Session Handling

问题

I have been trying to login to a website and then crawl some urls only accesible after signing in.

def start_requests(self):
    script = """
        function main(splash)
            splash:init_cookies(splash.args.cookies)
            assert(splash:go(splash.args.url))
            splash:set_viewport_full()

            local search_input = splash:select('input[name=username]')
            search_input:send_text("MY_USERNAME")

            splash:evaljs("document.getElementById('password').value = 'MY_PASSWORD';")

            local submit_button = splash:select('input[name=signin]')
            submit_button:click()

            local entries = splash:history()
            local last_response = entries[#entries].response

            return {
                cookies = splash:get_cookies(),
                headers = last_response.headers,
                html = splash:html()
            }
          end
    """

    yield scrapy_splash.SplashRequest(
        url='https://www.website.com/login',
        callback=self.after_login,
        endpoint='execute',
        cache_args=['lua_source'],
        args={'lua_source': script}
    )

def after_login(self, response):
    with open('after_login.html') as out:
        out.write(response.body.decode(''utf-8))

    script = """
        function main(splash)
            splash:init_cookies(splash.args.cookies)
            assert(splash:go(splash.args.url))
            splash:set_viewport_full()
            assert(splash:wait(10))

            return {
                cookies = splash:get_cookies(),
                html = splash:html()
            }
          end
    """
    yield scrapy_splash.SplashRequest(
        url='https://www.website.com/search?tools',
        callback=self.parse,
        endpoint='execute',
        cookies = response.data['cookies'],
        headers = response.data['headers'],
        args={'lua_source': script},
    )

def parse(self, response):
    with open('search_result.html', 'w+') as out:
        out.write(response.body.decode('utf-8'))

I am following the instructions in Session Handling. First, I login and I am begin redirected to the home page, this is correctly saved in login.html (Login is working). Then I take the cookies and set them in the second SplashRequest to search, however the response in search_result.html is that user is not logged in. What am I missing or doing wrong in order to persiste the session in different SplashRequests?

Regards,

回答1:

I'll answer this since it popped on google search.

Try setting sessiond_id on SplashRequest instead of cookies parameter like this:

yield scrapy_splash.SplashRequest(
    url='https://www.website.com/login',
    callback=self.after_login,
    endpoint='execute',
    cache_args=['lua_source'],
    args={'lua_source': script},
    session_id="foo"
)

And this:

yield scrapy_splash.SplashRequest(
    url='https://www.website.com/search?tools',
    callback=self.parse,
    endpoint='execute',
    session_id="foo",
    headers = response.data['headers'],
    args={'lua_source': script},
)

来源：https://stackoverflow.com/questions/44975670/scrapy-splash-session-handling

标签

python-3.x

web-crawler

scrapy-splash