How to randomly draw from subsets of data and bootstrap a statistic test in R

会有一股神秘感。 提交于 2019-12-12 20:55:30

问题


I have a dataset containing two variables and I wish to statistically test whether they are related in a bootstrap loop (i.e. using Spearman’s rank correction with cor.test(...)).

Most of the measurements in my dataset are from independent sample units (let’s call the units plants), although some measurements come from the same plant. To deal with issues of pseudoreplication, I wish to bootstrap the statistic test a number of times, using only one measurement from each plant in each run of the test. I therefore need to write a bootstrap loop that will randomly draw one measurement for each plant, before performing the correlation test (and then repeat this process 99 times).

I wish to end up with a csv file containing the p-value, rho and S statistic for each of the 99 tests.

Example data:

dput(df)

structure(list(Plant = c(1L, 2L, 3L, 4L, 5L, 6L, 6L, 7L, 8L, 
9L, 10L, 10L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 18L, 
19L, 20L, 21L), Length = c(170L, 232L, 123L, 190L, 112L, 207L, 
93L, 291L, 178L, 206L, 141L, 257L, 304L, 222L, 279L, 192L, 101L, 
253L, 176L, 278L, 311L, 129L, 191L, 205L, 226L), Count = c(7L, 
9L, 5L, 7L, 5L, 6L, 2L, 10L, 6L, 7L, 4L, 8L, 11L, 7L, 8L, 5L, 
5L, 9L, 7L, 6L, 9L, 4L, 5L, 7L, 6L)), .Names = c("Plant", "Length", 
"Count"), class = "data.frame", row.names = c(NA, -25L))


   Plant Length Count
1      1    170     7
2      2    232     9
3      3    123     5
4      4    190     7
5      5    112     5
6      6    207     6
7      6     93     2   
8      7    291    10  etc....

So far, I have put together the below code, which begins by randomly drawing a single row for each plant represented by multiple rows, and combines these values with the rest of the data before running the stats test. However, I am now struggling to incorporate a bootstrapping function (i.e. boot() or bootstrap()) to run the stats test and perform the cycle multiple times:

# 1. create dataframe without plants with >1 measurement/row (in this example plant 6,10 & 18 have multiple rows)
df_uniq = df[ ! df$Plant %in% c(6,10,18), ]

# 2. create data subsets for each plant with >1 measurement/row
dup1 = df[6:7,]
dup2 = df[11:13,] 
dup3 = df[21:22,]

# 3. randomly draw one row for each plant with multiple measurements
d1_draw = dup1[sample(nrow(dup1), 1), ]
d2_draw = dup2[sample(nrow(dup2), 1), ]
d3_draw = dup3[sample(nrow(dup3), 1), ]

# 4. merge df_uniq with randomly drawn rows for each plant with multiple measurements
df_merge = rbind(df_uniq, d1_draw, d2_draw, d3_draw)

# 5. Test whether the two variables (length & Count) are related and write results to file
cor_res <- cor.test(df_merge$Length, df_merge$Count, method= "spearman")
write.csv(matrix(c(cor_res$statistic, cor_res$p.value, cor_res$estimate)), row.names=c("statistic", "p.value", "rho"), "test_output.csv")

I am sure that there is a quick and elegant way to solve the problem. Any assistance would be greatly appreciated! Many thanks.


回答1:


why extract the unique rows in the first place? If there is only one row, then sampling that plant once will result in maintaining that row but still sampling randomly from those with more than one row.

So you can just do:

set.seed(123)
library(plyr)
ddply(df, .(Plant), function(x) { y <- x[sample(nrow(x), 1) ,] })

#   Plant Length Count howmany
#1      1    170     7       1
#2      2    232     9       1
#3      3    123     5       1
#4      4    190     7       1
#5      5    112     5       1
#6      6    207     6       2
#7      7    291    10       1
#8      8    178     6       1
#9      9    206     7       1
#10    10    257     8       3
#11    11    222     7       1
#12    12    279     8       1
#13    13    192     5       1
#14    14    101     5       1
#15    15    253     9       1
#16    16    176     7       1
#17    17    278     6       1
#18    18    311     9       2
#19    19    191     5       1
#20    20    205     7       1
#21    21    226     6       1

and with your cor.test

# first create your own function:
myrandomcors <- function(P){
ss <- ddply(P, .(Plant), function(x) { y <- x[sample(nrow(x), 1) ,] })
cor_res <- cor.test(ss$Length, ss$Count, method= "spearman")
return(c(stat = cor_res$statistic, p = cor_res$p.value, est = cor_res$estimate))
}

# then repeat it 5 times...
answer <- do.call( rbind, replicate(5, myrandomcors(df), simplify=FALSE ) )

#    > answer
#       stat.S            p   est.rho
#[1,] 352.4557 4.275291e-05 0.7711327
#[2,] 461.2733 4.060286e-04 0.7004719
#[3,] 340.2024 3.159626e-05 0.7790893
#[4,] 368.3967 6.227648e-05 0.7607814
#[5,] 342.4391 3.341956e-05 0.7776369


来源:https://stackoverflow.com/questions/20125858/how-to-randomly-draw-from-subsets-of-data-and-bootstrap-a-statistic-test-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!