问题
I'm trying to scrape data from a password-protected website in R. Reading around, it seems that the httr and RCurl packages are the best options for scraping with password authentication (I've also looked into the XML package).
The website I'm trying to scrape is below (you need a free account in order to access the full page): http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2
Here are my two attempts (replacing "username" with my username and "password" with my password):
#This returns "Status: 200" without the data from the page:
library(httr)
GET("http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2", authenticate("username", "password"))
#This returns the non-password protected preview (i.e., not the full page):
library(XML)
library(RCurl)
readHTMLTable(getURL("http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2", userpwd = "username:password"))
I have looked at other relevant posts (links below), but can't figure out how to apply their answers to my case.
How to use R to download a zipped file from a SSL page that requires cookies
How to webscrape secured pages in R (https links) (using readHTMLTable from XML package)?
Reading information from a password protected site
R - RCurl scrape data from a password-protected site
http://www.inside-r.org/questions/how-scrape-data-password-protected-https-website-using-r-hold
回答1:
I don't have an account to test with, but maybe this will work:
library(httr)
library(XML)
handle <- handle("http://subscribers.footballguys.com")
path <- "amember/login.php"
# fields found in the login form.
login <- list(
amember_login = "username"
,amember_pass = "password"
,amember_redirect_url =
"http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2"
)
response <- POST(handle = handle, path = path, body = login)
Now, the response object might hold what you need (or maybe you can directly query the page of interest after the login request; I am not sure the redirect will work, but it is a field in the web form), and handle
might be re-used for subsequent requests. Can't test it; but this works for me in many situations.
You can output the table using XML
> readHTMLTable(content(response))[[1]][1:5,]
Rank Name Tm/Bye Age Exp Cmp Att Cm% PYd Y/Att PTD Int Rsh Yd TD FantPt
1 1 Peyton Manning DEN/4 38 17 415 620 66.9 4929 7.95 43 12 24 7 0 407.15
2 2 Drew Brees NO/6 35 14 404 615 65.7 4859 7.90 37 16 22 44 1 385.35
3 3 Aaron Rodgers GB/9 31 10 364 560 65.0 4446 7.94 33 13 52 224 3 381.70
4 4 Andrew Luck IND/10 25 3 366 610 60.0 4423 7.25 27 13 62 338 2 361.95
5 5 Matthew Stafford DET/9 26 6 377 643 58.6 4668 7.26 32 19 34 102 1 358.60
回答2:
You can use RSelenium. I have used the dev version as you can run phantomjs
without a Selenium Server.
# Install RSelenium if required. You will need phantomjs in your path or follow instructions
# in package vignettes
# devtools::install_github("ropensci/RSelenium")
# login first
appURL <- 'http://subscribers.footballguys.com/amember/login.php'
library(RSelenium)
pJS <- phantom() # start phantomjs
remDr <- remoteDriver(browserName = "phantomjs")
remDr$open()
remDr$navigate(appURL)
remDr$findElement("id", "login")$sendKeysToElement(list("myusername"))
remDr$findElement("id", "pass")$sendKeysToElement(list("mypass"))
remDr$findElement("css", ".am-login-form input[type='submit']")$clickElement()
appURL <- 'http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2'
remDr$navigate(appURL)
tableElem<- remDr$findElement("css", "table.datamedium")
res <- readHTMLTable(header = TRUE, tableElem$getElementAttribute("outerHTML")[[1]])
> res[[1]][1:5, ]
Rank Name Tm/Bye Age Exp Cmp Att Cm% PYd Y/Att PTD Int Rsh Yd TD FantPt
1 1 Peyton Manning DEN/4 38 17 415 620 66.9 4929 7.95 43 12 24 7 0 407.15
2 2 Drew Brees NO/6 35 14 404 615 65.7 4859 7.90 37 16 22 44 1 385.35
3 3 Aaron Rodgers GB/9 31 10 364 560 65.0 4446 7.94 33 13 52 224 3 381.70
4 4 Andrew Luck IND/10 25 3 366 610 60.0 4423 7.25 27 13 62 338 2 361.95
5 5 Matthew Stafford DET/9 26 6 377 643 58.6 4668 7.26 32 19 34 102 1 358.60
Finally when you are finished close phantomjs
pJS$stop()
If you want to use a traditional browser like firefox for example (if you wanted to stick to the version on CRAN) you would use:
RSelenium::startServer()
remDr <- remoteDriver()
........
........
remDr$closeServer()
in place of the related phantomjs
calls.
来源:https://stackoverflow.com/questions/24723606/scrape-password-protected-website-in-r