Line by line reading from HTTPS connection in R

社会主义新天地 提交于 2019-12-23 12:50:57

问题


When a connection is created with open="r" it allows for line-by-line reading, which is useful for batch processing large data streams. For example this script parses a sizable gzipped JSON HTTP stream by reading 100 lines at a time. However unfortunately R does not support SSL:

> readLines(url("https://api.github.com/repos/jeroenooms/opencpu"))
Error in readLines(url("https://api.github.com/repos/jeroenooms/opencpu")) : 
  cannot open the connection: unsupported URL scheme

The RCurl and httr packages do support HTTPS, but I don't think they are capable of creating a connection object similar to url(). Is there some other way to do line-by-line reading of an HTTPS connection similar to the example in the script above?


回答1:


Yes, RCurl can "do line-by-line reading". In fact, it always does it, but the higher level functions hide this for you for convenience. You use the writefunction (and headerfunction for the header) to specify a function that is called each time libcurl has received enough bytes from the body of the result. That function can do anything it wants. There are several examples of this in the RCurl package itself. But here is a simple one

curlPerform(url = "http://www.omegahat.org/index.html", 
            writefunction = function(txt, ...) { 
                                 cat("*", txt, "\n")
                                 TRUE
                            })



回答2:


One solution is to manually call the curl executable via pipe. The following seems to work.

library(jsonlite)
stream_https <- gzcon(pipe("curl https://jeroenooms.github.io/files/hourly_14.json.gz", open="r"))
batches <- list(); i <- 1
while(length(records <- readLines(gzstream, n = 100))){
  message("Batch ", i, ": found ", length(records), " lines of json...")
  json <- paste0("[", paste0(records, collapse=","), "]")
  batches[[i]] <- fromJSON(json, validate=TRUE)
  i <- i+1
}
weather <- rbind.pages(batches)
rm(batches); close(gzstream)

However this is suboptimal because the curl executable might not be available for various reasons. Would be much nicer to invoke this pipe directly via RCurl/libcurl.



来源:https://stackoverflow.com/questions/25700909/line-by-line-reading-from-https-connection-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!