问题
I am new to R and I am trying to build a frequency/severity simulation. Everything is working fine except that it takes about 10min to do 10000 simulations for each of 700 locations. For the simulation of one individual location, I got a list of vectors with varying lengths and I would like to efficiently rbind these vectors, filling in NAs for all non-existing values. I would like R to return a data.frame to me. So far, I used rbind.fill.matrix after converting the vectors in the list to matrices of 1 row. However, I am hoping that I could use something like bind_rows (dplyr) or rbindfill but I don't know how to transform the vectors into something that I could use for these functions. Thank you in advance for your help!
set.seed(1223)
library(data.table)
numsim = 10
rN.D <- function(numsim) rpois(numsim, 4)
rX.D <- function(numsim) rnorm(numsim, mean = 5, sd = 4)
freqs <- rN.D(numsim)
obs <- lapply(freqs, function(x) rX.D(x))
#obs is the list that I would like to rbind (efficiently!) and have a data.frame returned to me
回答1:
We can append NA
s at the end to make the length
same for each of the list
elements and then do the rbind
out <- do.call(rbind, lapply(obs, `length<-`, max(lengths(obs))))
as.data.frame(out) # if we need a data.frame as output
or using tidyverse
library(tidyverse)
obs %>%
set_names(seq_along(.)) %>%
stack %>%
group_by(ind) %>%
mutate(Col = paste0("Col", row_number())) %>%
spread(Col, values)
回答2:
Everything is working fine except that it takes [too long] to do [
numsim
] simulations
If your real application uses rnorm
or similar, you can make a single call to it:
set.seed(1223)
numsim = 3e5
freqs = rN.D(numsim)
maxlen = max(freqs)
m = matrix(, maxlen, numsim)
m[row(m) <= freqs[col(m)]] <- rX.D(sum(freqs))
res = as.data.table(t(m))
I am filling the data the "wrong way" (with each simulation on a column instead of a row) and then transposing since R fills matrix values using "column-major" order.
If you need to use lapply
, here's a benchmark for the final step:
set.seed(1223)
library(dplyr); library(tidyr); library(purrr)
library(data.table)
numsim = 3e5
rN.D <- function(numsim) rpois(numsim, 4)
rX.D <- function(numsim) rnorm(numsim, mean = 5, sd = 4)
freqs <- rN.D(numsim)
obs <- lapply(freqs, function(x) rX.D(x))
system.time({
tidyres = obs %>%
set_names(seq_along(.)) %>%
stack %>%
group_by(ind) %>%
mutate(Col = paste0("Col", row_number())) %>%
spread(Col, values)
})
# user system elapsed
# 16.56 0.31 16.88
system.time({
out <- do.call(rbind, lapply(obs, `length<-`, max(lengths(obs))))
bres = as.data.frame(out)
})
# user system elapsed
# 0.50 0.05 0.55
system.time(
dtres <- setDT(transpose(obs))
)
# user system elapsed
# 0.03 0.01 0.05
The last approach is fastest compared to the other two (both from @akrun's answer).
Comment. I would recommend using only data.table or tidyverse. Mixing and matching will get messy very quickly. When I was setting this example up, I saw that purrr
has it's own transpose
function, so if you loaded packages in a different order, code like this can give different results without warning.
来源:https://stackoverflow.com/questions/51486317/rbind-list-of-vectors-with-differing-lengths