问题
I'm trying to get the list of files in a directory on a website. Is there a way to do this similar to the dir() or list.files() commands for local directory listing? I can connect to the website using RCurl (I need it because I need an SSL connection over HTTPS):
library(RCurl)
text=getURL(*some https website*
,ssl.verifypeer = FALSE
,dirlistonly = TRUE)
But this creates an HTML file with images, hyperlinks, etc. of a list of files, but I just need an R vector of files as you would obtain with dir(). Is this possible? Or would I have to do HTML parsing to extract the filenames? Sounds like a complicated approach for a simple problem.
Thanks,
EDIT: if you can get it to work with http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGencodeV7/ then you'll see what I mean.
回答1:
This is last example in the help file for getURL (with an updated URL):
url <- 'ftp://speedtest.tele2.net/'
filenames = getURL(url, ftp.use.epsv = FALSE, dirlistonly = TRUE)
# Deal with newlines as \n or \r\n. (BDR)
# Or alternatively, instruct libcurl to change \n’s to \r\n’s for us with crlf = TRUE
# filenames = getURL(url, ftp.use.epsv = FALSE, ftplistonly = TRUE, crlf = TRUE)
filenames = paste(url, strsplit(filenames, "\r*\n")[[1]], sep = "")
Does that solve your problem?
回答2:
Try this:
library(RCurl)
dir_list <-
read.table(
textConnection(
getURLContent(ftp://[...]/)
)
sep = "",
strip.white = TRUE)
The resulting table separates the date into 3 text fields, but it is a big start and you can get the filenames.
回答3:
I was reading a RCurl document and came across a new piece of code:
stockReader =
function()
{
values <- numeric() # to which the data is appended when received
# Function that appends the values to the centrally stored vector
read = function(chunk) {
con = textConnection(chunk)
on.exit(close(con))
tmp = scan(con)
values <<- c(values, tmp)
}
list(read = read,
values = function() values # accessor to get result on completion
)
}
followed by
reader = stockReader()
getURL(’http://www.omegahat.org/RCurl/stockExample.dat’,
write = reader$read)
reader$values()
it says 'numeric' in the sample but surely this code sample can be adapted? Read the attached document. I'm sure you will find what you're looking for.
It also says
The basic use of getURL(), getForm() and postForm() returns the contents of the requested document as a single block of text. It is accumulated by the libcurl facilities and combined into a single string. We then typically traverse the contents of the document to extract the information into regular data, e.g. vectors and data frames. For example, suppose the document we requested is a simple stream of numbers such as prices of a particular stock at different time points. We would download the contents of the file, and then read it into a vector in R so that we could analyze the values. Unfortunately, this results in essentially two copies of the data residing in memory simultaneously. This can be prohibitive or at least undesirable for large datasets. An alternative approach is to process the data in chunks as it is received by libcurl. If we can be notified each time libcurl receives data from the reply and do something meaningful with the data, then we need not accumulate the chunks. The largest extra piece of information we will need to have is the largest chunk. In our example, we could take each chunk and pass it to the scan() function to turn the values into a vector. Then we can concatenate this with the vector from the previously processed chunks.
来源:https://stackoverflow.com/questions/16699856/get-website-directory-listing-in-an-r-vector-using-rcurl