Set cookies with rvest

假装没事ソ 提交于 2021-01-27 23:26:25

问题


I would like to programmatically export the records available at this website. To do this manually, I would navigate to the page, click export, and choose the csv.

I tried copying the link from the export button which will work as long as I have a cookie (I believe). So a wget or httr request will result in the html site instead of the file.

I've found some help from an issue on the rvest github repo but ultimately I can't really figure out like the issue maker how to use objects to save the cookie and use it in a request.

Here is where I'm at:

library(httr)
library(rvest)

apoc <- html_session("https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx")
headers <- headers(apoc)

GET(url = "https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx?exportAll=False&exportFormat=CSV&isExport=True", 
    add_headers(headers)) # how can I take the output from headers in httr and use it as an argument in GET from httr?

I have checked the robots.txt and this is permissible.


回答1:


You can get the __VIEWSTATE and __VIEWSTATEGENERATOR from the headers when you GET https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx and then reuse those __VIEWSTATE and __VIEWSTATEGENERATOR in your subsequent POST query and GET csv.

options(stringsAsFactors=FALSE)
library(httr)
library(curl)
library(xml2)

url <- 'https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx'

#get session headers
req <- GET(url)
req_html <- read_html(rawToChar(req$content))
fields <- c("__VIEWSTATE","__VIEWSTATEGENERATOR")
viewheaders <- lapply(fields, function(x) {
    xml_attr(xml_find_first(req_html, paste0(".//input[@id='",x,"']")), "value")
})
names(viewheaders) <- fields

#post request. you can get the list of form fields using tools like Fiddler
params <- c(viewheaders,
    list(
        "M$ctl19"="M$UpdatePanel|M$C$csfFilter$btnExport",
        "M$C$csfFilter$ddlNameType"="Any",
        "M$C$csfFilter$ddlField"="Elections",
        "M$C$csfFilter$ddlReportYear"="2017",
        "M$C$csfFilter$ddlStatus"="Default",
        "M$C$csfFilter$ddlValue"=-1,
        "M$C$csfFilter$btnExport"="Export"))
resp <- POST(url, body=params, encode="form")
print(resp$status_code)
resptext <- rawToChar(resp$content)
#writeLines(resptext, "apoc.html")

#get response i.e. download csv
url <- "https://aws.state.ak.us//ApocReports/Registration/CandidateRegistration/CRForms.aspx?exportAll=True&exportFormat=CSV&isExport=True"
req <- GET(url, body=params)
read.csv(text=rawToChar(req$content))

You might need to play around with the inputs/code to get what you want precisely.

Here is another similar solution using RCurl: how-to-login-and-then-download-a-file-from-aspx-web-pages-with-r



来源:https://stackoverflow.com/questions/48389847/set-cookies-with-rvest

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!