R bootstrap statistics by group for big data

烂漫一生 提交于 2019-12-21 02:36:22

问题


I want to bootstrap a data set that has groups in it. A simple scenario would be bootstrapping simple means:

data <- as.data.table(list(x1 = runif(200), x2 = runif(200), group = runif(200)>0.5))
stat <- function(x, i) {x[i, c(m1 = mean(x1), m2 = mean(x2)), by = "group"]}
boot(data, stat, R = 10)

This gives me the error incorrect number of subscripts on matrix, because of by = "group" part. I managed to solve it using subsetting, but don't like this solution. Is there simpler way to make this kind of task work?

In particular, I'd like to introduce an additional argument in the statistics function like stat(x, i, groupvar) and pass it to the boot function like boot(data, stat(groupvar = group), R = 100)?


回答1:


This should do it:

data[, list(list(boot(.SD, stat, R = 10))), by = group]$V1



回答2:


Using

 boot       * 1.3-18  2016-02-23 CRAN (R 3.2.3)                        
 data.table * 1.9.7   2015-10-05 Github (Rdatatable/data.table@d607425)

I received an error using the OP's code with the answer supplied by @eddi:

data <- as.data.table(list(x1 = runif(200), x2 = runif(200), group = runif(200)>0.5))
stat <- function(x, i) {x[i, c(m1 = mean(x1), m2 = mean(x2)), by = "group"]}
data[, list(list(boot(.SD, stat, R = 10))), by = group]$V1

Produces the error message:

Error in eval(expr, envir, enclos) : object 'group' not found 

The error is fixed by removing by=group from the function stat:

set.seed(1000)
data <- as.data.table(list(x1 = runif(200), x2 = runif(200), group = runif(200)>0.5))
stat <- function(x, i) {x[i, c(m1 = mean(x1), m2 = mean(x2))]}
data[, list(list(boot(.SD, stat, R = 10))), by = group]$V1

Which produces the following Bootstrap Statistics results:

[[1]]

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = .SD, statistic = stat, R = 10)


Bootstrap Statistics :
     original       bias    std. error
t1* 0.5158232  0.004930451  0.01576641
t2* 0.5240713 -0.001851889  0.02851483

[[2]]

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = .SD, statistic = stat, R = 10)


Bootstrap Statistics :
     original        bias    std. error
t1* 0.5142383 -0.0072475030  0.02568692
t2* 0.5291694 -0.0001509404  0.02378447

Below, I modify the sample dataset to highlight which Bootstrap Statistic goes with which group-column combination:

Consider group 1 which has a mean value of 10 for x1 and a mean value of 10000 for x2 and group 2 which has a mean value of 2000 for x1 and a mean value of 8000 for x2:

data2 <- as.data.table(list(x1 = c(runif(100, 9,11),runif(100, 1999,2001)), x2 = c(runif(100, 9999,10001),runif(100, 7999,8001)), group = rep(c(1,2), each=100)))
stat <- function(x, i) {x[i, c(m1 = mean(x1), m2 = mean(x2))]}
data2[, list(list(boot(.SD, stat, R = 10))), by = group]$V1

Which gives:

[[1]]

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = .SD, statistic = stat, R = 10)


Bootstrap Statistics :
      original       bias    std. error
t1*   10.00907  0.007115938  0.04349184
t2* 9999.90176 -0.019569568  0.06160653

[[2]]

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = .SD, statistic = stat, R = 10)


Bootstrap Statistics :
    original       bias    std. error
t1* 1999.965  0.031694179  0.06561209
t2* 8000.110 -0.006569872  0.03992401



回答3:


Lots of problems in your code before you even get to the by group part.

Did you mean something like this?

data <- as.data.frame(list(x1 = runif(200), x2 = runif(200), group = factor(sample(letters[1:2]))))
stat <- function(x, i)  c(m1 = mean(x$x1[i]), m2 = mean(x$x2[i]))

> stat(x,1:10)
       m1        m2 
0.4465738 0.5522221 

Then from there you can worry about doing it by group however you choose to.

For instance:

library(plyr)
dlply( data, .(group), function( dat ) boot(dat, stat, R=10) )

For bigger datasets, try data.table:

by( seq(nrow(data)), data$group, function(idx) myboot(data[idx,]))

I went with by() rather than the data.table's ,by= argument because you want the output to be a list. There may be some functionality I don't know about for doing that, but I couldn't find it (see the edit history for the problem it was causing).

The subsetting is still done via the data.table's [] method, so it should be plenty fast.



来源:https://stackoverflow.com/questions/18913000/r-bootstrap-statistics-by-group-for-big-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!