问题
i originally asked this question about performing this task with the httr
package, but i don't think it's possible using httr
. so i've re-written my code to use RCurl
instead -- but i'm still tripping up on something probably related to the writefunction
.. but i really don't understand why.
you should be able to reproduce my work by using the 32-bit version of R, so you hit memory limits if you read anything into RAM. i need a solution that downloads directly to the hard disk.
to start, this code to works -- the zipped file is appropriately saved to the disk.
library(RCurl)
filename <- tempfile()
f <- CFILE(filename, "wb")
url <- "http://www2.census.gov/acs2011_5yr/pums/csv_pus.zip"
curlPerform(url = url, writedata = f@ref)
close(f)
# 2.1 GB file successfully written to disk
now here's some RCurl
code that does not work. as stated in the previous question, reproducing this exactly will require creating an extract on ipums.
your.email <- "email@address.com"
your.password <- "password"
extract.path <- "https://usa.ipums.org/usa-action/downloads/extract_files/some_file.csv.gz"
library(RCurl)
values <-
list(
"login[email]" = your.email ,
"login[password]" = your.password ,
"login[is_for_login]" = 1
)
curl = getCurlHandle()
curlSetOpt(
cookiejar = 'cookies.txt',
followlocation = TRUE,
autoreferer = TRUE,
ssl.verifypeer = FALSE,
curl = curl
)
params <-
list(
"login[email]" = your.email ,
"login[password]" = your.password ,
"login[is_for_login]" = 1
)
html <- postForm("https://usa.ipums.org/usa-action/users/validate_login", .params = params, curl = curl)
dl <- getURL( "https://usa.ipums.org/usa-action/extract_requests/download" , curl = curl)
and now that i'm logged in, try the same commands as above, but with the curl
object to keep the cookies.
filename <- tempfile()
f <- CFILE(filename, mode = "wb")
this line breaks--
curlPerform(url = extract.path, writedata = f@ref, curl = curl)
close(f)
# the error is:
Error in curlPerform(url = extract.path, writedata = f@ref, curl = curl) :
embedded nul in string: [[binary jibberish here]]
the answer to my previous post referred me to this c-level writefunction answer, but i'm clueless about how to re-create that curl_writer C program (on windows?)..
dyn.load("curl_writer.so")
writer <- getNativeSymbolInfo("writer", PACKAGE="curl_writer")$address
curlPerform(URL=url, writefunction=writer)
..or why it's even necessary, given that the five lines of code at the top of this question work without anything crazy like getNativeSymbolInfo
. i just don't understand why passing in that extra curl
object that stores the authentication/cookies and tells it not to verify SSL would cause code that otherwise works.. to break?
回答1:
From this link create a file named
curl_writer.c
and save it toC:\<folder where you save your R files>
#include <stdio.h> /** * Original code just sent some message to stderr */ size_t writer(void *buffer, size_t size, size_t nmemb, void *stream) { fwrite(buffer,size,nmemb,(FILE *)stream); return size * nmemb; }
Open a command window, go to the folder where you saved
curl_writer.c
and run the R compilerc:> cd "C:\<folder where you save your R files>" c:> R CMD SHLIB -o curl_writer.dll curl_writer.c
Open R and run your script
C:> R your.email <- "email@address.com" your.password <- "password" extract.path <- "https://usa.ipums.org/usa-action/downloads/extract_files/some_file.csv.gz" library(RCurl) values <- list( "login[email]" = your.email , "login[password]" = your.password , "login[is_for_login]" = 1 ) curl = getCurlHandle() curlSetOpt( cookiejar = 'cookies.txt', followlocation = TRUE, autoreferer = TRUE, ssl.verifypeer = FALSE, curl = curl ) params <- list( "login[email]" = your.email , "login[password]" = your.password , "login[is_for_login]" = 1 ) html <- postForm("https://usa.ipums.org/usa-action/users/validate_login", .params = params, curl = curl) dl <- getURL( "https://usa.ipums.org/usa-action/extract_requests/download" , curl = curl) # Load the DLL you created # "writer" is the name of the function # "curl_writer" is the name of the dll dyn.load("curl_writer.dll") writer <- getNativeSymbolInfo("writer", PACKAGE="curl_writer")$address # Note that "URL" parameter is upper case, in your code it is lowercase # I'm not sure if that has something to do # "writer" is the symbol defined above f <- CFILE(filename <- tempfile(), "wb") curlPerform(URL=url, writedata=f@ref, writefunction=writer, curl=curl) close(f)
回答2:
this is now possible with the httr
package. thanks hadley!
https://github.com/hadley/httr/issues/44
来源:https://stackoverflow.com/questions/17329288/how-to-download-a-large-binary-file-with-rcurl-after-server-authentication