问题
I have a table with 40+ columns and 200.000+ rows. Something like this:
ID GROUP-A1 GROUP-A2 GROUP A3...A20 GROUP-B1 GROUP-B2 GROUP-B3...B20
1 5 6 3 5....3 10 21 9 15
2 3 4 6 2....13 23 42 34 23
3 5 3 1 0....12 10 12 43 15
4 0 0 2 5....3 10 21 23 15
I would like to run a t-test for the two groups A (1..20) and B (1..20) for every measurement I have (each row), which are independent. And possibly, have the resulting stats in the table next to each row or in a separate table, so I can easily select the significant ones.
I looked at few R packages but they mostly would require reformatting the table I have, to put measurements and groups in columns, and I would need 200.000+ separate tables in that case.
Any idea?
回答1:
Something like this?
apply(df,1,function(x){t.test(x[2:21],x[22:41])})
To save the test statistic or p-value in a new column you could do
df$st=apply(df,1,function(x){t.test(x[2:21],x[22:41])$stat})
or $p.value
回答2:
You can run all tests with the following code.
i_group_a <- grep("GROUP.A", names(df1), ignore.case = TRUE)
i_group_b <- grep("GROUP.B", names(df1), ignore.case = TRUE)
ttest_list <- lapply(seq_along(i_group_a), function(k){
i <- i_group_a[k]
j <- i_group_b[k]
t.test(df1[[i]], df1[[j]])
})
ttest_list[[1]]
#
# Welch Two Sample t-test
#
#data: df1[[i]] and df1[[j]]
#t = -2.8918, df = 3.7793, p-value = 0.04763
#alternative hypothesis: true difference in means is not equal to 0
#95 percent confidence interval:
# -19.826402 -0.173598
#sample estimates:
#mean of x mean of y
# 3.25 13.25
To extract, for instance, the p-values:
pval <- sapply(ttest_list, `[[`, 'p.value')
pval
#[1] 0.04762593 0.04449075 0.04390115 0.00192454
Data.
df1 <- read.table(text = "
ID GROUP-A1 GROUP-A2 GROUP-A3 GROUP-A20 GROUP-B1 GROUP-B2 GROUP-B3 GROUP-B20
1 5 6 3 5 10 21 9 15
2 3 4 6 2 23 42 34 23
3 5 3 1 0 10 12 43 15
4 0 0 2 5 10 21 23 15
", header = TRUE)
回答3:
You can do this with tidyverse
using purrr
. It does however require to format your data differently. Here is an example:
require(tidyverse)
set.seed(314)
simulate your data
df <- data.frame(ID = rep(1:5,each = 20),
participant = rep(rep(1:10,2),5),
group = rep(rep(c('A','B'),each = 10),5),
answer = sample(1:10,100, replace = T))
dfflat <- df %>%
unite(column, group,participant) %>%
spread(column,answer)
dfflat:
ID A_1 A_10 A_2 A_3 A_4 A_5 A_6 A_7 A_8 A_9 B_1 B_10 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9
1 1 1 8 3 8 3 3 4 3 4 6 4 4 2 3 3 6 4 8 6 1
2 2 7 6 5 6 3 1 6 4 1 3 3 6 7 1 5 5 2 10 10 6
3 3 4 3 8 5 9 7 9 7 3 1 8 2 7 6 8 3 5 6 9 4
4 4 5 4 8 2 4 1 4 6 2 2 1 1 7 10 6 9 7 7 10 1
5 5 4 1 5 10 3 5 3 10 8 3 7 3 4 6 6 9 10 7 4 5
the equivalent in long format:
dfflat %>%
gather(participant,answer,-ID) %>%
separate(participant,c('group','number'))
ID group number answer
1 1 A 1 1
2 2 A 1 7
3 3 A 1 4
4 4 A 1 5
5 5 A 1 4
6 1 A 10 8
7 2 A 10 6
8 3 A 10 3
9 4 A 10 4
10 5 A 10 1
11 1 A 2 3
12 2 A 2 5
13 3 A 2 8
14 4 A 2 8
15 5 A 2 5
16 1 A 3 8
17 2 A 3 6
18 3 A 3 5
19 4 A 3 2
20 5 A 3 10
...
Test the hypothesis with t.test
per ID
and extract the p.value
dfflat %>%
gather(participant,answer,-ID) %>%
separate(participant,c('group','number')) %>%
group_by(ID) %>%
nest() %>%
mutate(test = map(data, ~ with(.x, t.test(answer[group == 'A'],answer[group == 'B']))),
p.value = map_dbl(test,pluck,'p.value'))
results in:
# A tibble: 5 x 4
ID data test p.value
<int> <list> <list> <dbl>
1 1 <tibble [20 x 3]> <S3: htest> 0.841
2 2 <tibble [20 x 3]> <S3: htest> 0.284
3 3 <tibble [20 x 3]> <S3: htest> 0.863
4 4 <tibble [20 x 3]> <S3: htest> 0.137
5 5 <tibble [20 x 3]> <S3: htest> 0.469
来源:https://stackoverflow.com/questions/57990378/t-test-for-multiple-rows-in-r