问题
When a connection is created with open="r"
it allows for line-by-line reading, which is useful for batch processing large data streams. For example this script parses a sizable gzipped JSON HTTP stream by reading 100 lines at a time. However unfortunately R does not support SSL:
> readLines(url("https://api.github.com/repos/jeroenooms/opencpu"))
Error in readLines(url("https://api.github.com/repos/jeroenooms/opencpu")) :
cannot open the connection: unsupported URL scheme
The RCurl
and httr
packages do support HTTPS, but I don't think they are capable of creating a connection object similar to url()
. Is there some other way to do line-by-line reading of an HTTPS connection similar to the example in the script above?
回答1:
Yes, RCurl can "do line-by-line reading". In fact, it always does it, but the higher level functions hide this for you for convenience. You use the writefunction (and headerfunction for the header) to specify a function that is called each time libcurl has received enough bytes from the body of the result. That function can do anything it wants. There are several examples of this in the RCurl package itself. But here is a simple one
curlPerform(url = "http://www.omegahat.org/index.html",
writefunction = function(txt, ...) {
cat("*", txt, "\n")
TRUE
})
回答2:
One solution is to manually call the curl
executable via pipe
. The following seems to work.
library(jsonlite)
stream_https <- gzcon(pipe("curl https://jeroenooms.github.io/files/hourly_14.json.gz", open="r"))
batches <- list(); i <- 1
while(length(records <- readLines(gzstream, n = 100))){
message("Batch ", i, ": found ", length(records), " lines of json...")
json <- paste0("[", paste0(records, collapse=","), "]")
batches[[i]] <- fromJSON(json, validate=TRUE)
i <- i+1
}
weather <- rbind.pages(batches)
rm(batches); close(gzstream)
However this is suboptimal because the curl
executable might not be available for various reasons. Would be much nicer to invoke this pipe directly via RCurl/libcurl.
来源:https://stackoverflow.com/questions/25700909/line-by-line-reading-from-https-connection-in-r