I have a lot of files I need to download.
I am using download.file()
function and furrr::map
to download in parallel, with plan(strategy = "multicore")
Please advise how can I load more jobs for each future?
Running on Ubuntu 18.04 with 8 cores. R version 3.5.3.
The files can be txt, zip or any other format. Size varies in range of 5MB - 40MB each.
Using furrr works just fine. I think what you mean is furrr::future_map
. Using multicore
substantially increases the downloading speed (Note: on Windows, multicore
is not available, only multisession
. Use multiprocess
if you are unsure what platform your code will be run on).
#> Loading required package: future
csv_file <- "https://raw.githubusercontent.com/UofTCoders/rcourse/master/data/iris.csv"
download_template <- function(.x) {
temp_file <- tempfile(pattern = paste0("dl-", .x, "-"), fileext = ".csv")
download.file(url = csv_file, destfile = temp_file)
download_normal <- function() {
for (i in 1:5) {
download_future_core <- function() {
future_map(1:5, download_template)
download_future_session <- function() {
future_map(1:5, download_template)
times = 3
#> Unit: milliseconds
#> expr min lq mean median
#> download_normal() 931.2587 935.0187 937.2114 938.7787
#> download_future_core() 433.0860 435.1674 488.5806 437.2489
#> download_future_session() 1894.1569 1903.4256 1919.1105 1912.6942
#> uq max neval
#> 940.1877 941.5968 3
#> 516.3279 595.4069 3
#> 1931.5873 1950.4803 3
Created on 2019-03-25 by the reprex package (v0.2.1)
Keep in mind, I am using Ubuntu, so using Windows will likely change things, since as far as I understand future doesn't allow multicore on Windows.
I am just guessing here, but the reason that multisession
is slower could be because it has to open up several R sessions before running the download.file
function. I was just downloading a very small dataset (iris.csv
), so maybe on larger datasets that take more time, the time taken to open an R session would be offset by the time it takes to download larger files.
Minor update:
You can pass a vector of URLs to the datasets into future_map
so it downloads each file as determined by the future package processing:
data_urls <- c("https:.../data.csv", "https:.../data2.csv")
future_map(data_urls, download.file)
# Or use walk
# future_walk(data_urls, download.file)