R: use rvest (or httr) to log in to a site requiring cookies

喜你入骨 提交于 2019-12-04 01:49:14

问题


I'm trying to automate the shibboleth-based login process for the UK Data Service in R. One can sign up for an account to login here. A previous attempt to automate this process is found in this question, automating the login to the uk data service website in R with RCurl or httr.

I thought the excellent answers to this question, how to authenticate a shibboleth multi-hostname website with httr in R, were going to get me there, but I've run into a wall.

And, yes, RSelenium provides an alternative—which I've actually tried—but my experience with RSelenium is that it is always flaking out (not to mention that it is hard to get to work across platforms), while rvest/httr/RCurl solutions don't break unless or until the website changes and are easy to get working on other people's machines.

Anyway, the site requires you to click through an initial signin page (and get a cookie), then enter your organization (click through and get cookies), then enter your username and password (cookies), and then (because rvest doesn't do javascript) click through one more cookie-modifying page, before landing on the "your account" page. It looks to me that the cookies at all steps are necessary—the one that eventually signifies that you've logged in (ASPSESSIONIDSQAQSSQA) is the one created by the initial signin page.

So here's what I have so far. First, get to the organization page and enter the organization, saving the cookies from the initial signin page (using the trick from here, Submit form with no submit button in rvest, to cope with the fact that the submit button doesn't activate until an organization is entered).

library(tidyverse)
library(rvest)
library(stringr)

org <- "your_organization"
user <- "your_username"
password <- "your_password"

signin <- "http://esds.ac.uk/newRegistration/newLogin.asp"
handle_reset(signin)

# get to org page and enter org
p0 <- html_session(signin) %>% 
    follow_link("Login")
org_link <- html_nodes(p0, "option") %>% 
    str_subset(org) %>% 
    str_match('(?<=\\")[^"]*') %>%
    as.character()

f0 <- html_form(p0) %>%
    first() %>%
    set_values(origin = org_link)
fake_submit_button <- list(name = "submit-btn",
                           type = "submit",
                           value = "Continue",
                           checked = NULL,
                           disabled = NULL,
                           readonly = NULL,
                           required = FALSE)
attr(fake_submit_button, "class") <- "btn-enabled"
f0[["fields"]][["submit"]] <- fake_submit_button

c0 <- cookies(p0)$value
names(c0) <- cookies(p0)$name
p1 <- submit_form(session = p0, form = f0, config = set_cookies(.cookies = c0))

Then, enter the username and password:

# enter user and password
f1 <- html_form(p1) %>%
    first() %>%
    set_values("j_username" = user,
               "j_password" = password)
c1 <- cookies(p1)$value
names(c1) <- cookies(p1)$name
p2 <- submit_form(session = p1, form = f1, config = set_cookies(.cookies = c1))

p2$response says "Since your browser does not support JavaScript, you must press the Continue button once to proceed", so:

# click through
f2 <- p2 %>%
    html_form() %>%
    first()
c2 <- cookies(p2)$value
names(c2) <- cookies(p2)$name

p3 <- submit_form(p2, f2, config = set_cookies(.cookies = c2))

Sadly, instead of finally being "your account", p3 actually winds us back up at the organization-entry page p0.

One potentially important issue is that c2 contains two JSESSIONID cookies that cookies(p2) shows are for different domains. I don't know what to do about that—I've tried dropping first one then the other from c2 with no luck. Any suggestions? Thanks!

来源:https://stackoverflow.com/questions/42701409/r-use-rvest-or-httr-to-log-in-to-a-site-requiring-cookies

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!