问题
I have been trying to login to a website and then crawl some urls only accesible after signing in.
def start_requests(self):
script = """
function main(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go(splash.args.url))
splash:set_viewport_full()
local search_input = splash:select('input[name=username]')
search_input:send_text("MY_USERNAME")
splash:evaljs("document.getElementById('password').value = 'MY_PASSWORD';")
local submit_button = splash:select('input[name=signin]')
submit_button:click()
local entries = splash:history()
local last_response = entries[#entries].response
return {
cookies = splash:get_cookies(),
headers = last_response.headers,
html = splash:html()
}
end
"""
yield scrapy_splash.SplashRequest(
url='https://www.website.com/login',
callback=self.after_login,
endpoint='execute',
cache_args=['lua_source'],
args={'lua_source': script}
)
def after_login(self, response):
with open('after_login.html') as out:
out.write(response.body.decode(''utf-8))
script = """
function main(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go(splash.args.url))
splash:set_viewport_full()
assert(splash:wait(10))
return {
cookies = splash:get_cookies(),
html = splash:html()
}
end
"""
yield scrapy_splash.SplashRequest(
url='https://www.website.com/search?tools',
callback=self.parse,
endpoint='execute',
cookies = response.data['cookies'],
headers = response.data['headers'],
args={'lua_source': script},
)
def parse(self, response):
with open('search_result.html', 'w+') as out:
out.write(response.body.decode('utf-8'))
I am following the instructions in Session Handling. First, I login and I am begin redirected to the home page, this is correctly saved in login.html (Login is working). Then I take the cookies and set them in the second SplashRequest to search, however the response in search_result.html is that user is not logged in. What am I missing or doing wrong in order to persiste the session in different SplashRequests?
Regards,
回答1:
I'll answer this since it popped on google search.
Try setting sessiond_id
on SplashRequest
instead of cookies
parameter like this:
yield scrapy_splash.SplashRequest(
url='https://www.website.com/login',
callback=self.after_login,
endpoint='execute',
cache_args=['lua_source'],
args={'lua_source': script},
session_id="foo"
)
And this:
yield scrapy_splash.SplashRequest(
url='https://www.website.com/search?tools',
callback=self.parse,
endpoint='execute',
session_id="foo",
headers = response.data['headers'],
args={'lua_source': script},
)
来源:https://stackoverflow.com/questions/44975670/scrapy-splash-session-handling