T-test for multiple rows in R

问题

I have a table with 40+ columns and 200.000+ rows. Something like this:

ID GROUP-A1 GROUP-A2 GROUP A3...A20   GROUP-B1 GROUP-B2 GROUP-B3...B20
1  5        6        3     5....3     10       21       9          15
2  3        4        6     2....13    23       42       34         23
3  5        3        1     0....12    10       12       43         15 
4  0        0        2     5....3     10       21       23         15

I would like to run a t-test for the two groups A (1..20) and B (1..20) for every measurement I have (each row), which are independent. And possibly, have the resulting stats in the table next to each row or in a separate table, so I can easily select the significant ones.

I looked at few R packages but they mostly would require reformatting the table I have, to put measurements and groups in columns, and I would need 200.000+ separate tables in that case.

Any idea?

回答1:

Something like this?

apply(df,1,function(x){t.test(x[2:21],x[22:41])})

To save the test statistic or p-value in a new column you could do

df$st=apply(df,1,function(x){t.test(x[2:21],x[22:41])$stat})

or $p.value

回答2:

You can run all tests with the following code.

i_group_a <- grep("GROUP.A", names(df1), ignore.case = TRUE)
i_group_b <- grep("GROUP.B", names(df1), ignore.case = TRUE)

ttest_list <- lapply(seq_along(i_group_a), function(k){
  i <- i_group_a[k]
  j <- i_group_b[k]
  t.test(df1[[i]], df1[[j]])
})

ttest_list[[1]]
#
#   Welch Two Sample t-test
#
#data:  df1[[i]] and df1[[j]]
#t = -2.8918, df = 3.7793, p-value = 0.04763
#alternative hypothesis: true difference in means is not equal to 0
#95 percent confidence interval:
# -19.826402  -0.173598
#sample estimates:
#mean of x mean of y 
#     3.25     13.25

To extract, for instance, the p-values:

pval <- sapply(ttest_list, `[[`, 'p.value')
pval
#[1] 0.04762593 0.04449075 0.04390115 0.00192454

Data.

df1 <- read.table(text = "
ID GROUP-A1 GROUP-A2 GROUP-A3 GROUP-A20   GROUP-B1 GROUP-B2 GROUP-B3   GROUP-B20
1  5        6        3        5           10       21       9          15
2  3        4        6        2           23       42       34         23
3  5        3        1        0           10       12       43         15 
4  0        0        2        5           10       21       23         15
", header = TRUE)

回答3:

You can do this with tidyverse using purrr. It does however require to format your data differently. Here is an example:

require(tidyverse)
set.seed(314)

simulate your data


df <- data.frame(ID = rep(1:5,each = 20),
                 participant = rep(rep(1:10,2),5),
                 group = rep(rep(c('A','B'),each = 10),5),
                 answer = sample(1:10,100, replace = T))

dfflat <- df %>% 
  unite(column, group,participant) %>%
  spread(column,answer)

dfflat:

  ID A_1 A_10 A_2 A_3 A_4 A_5 A_6 A_7 A_8 A_9 B_1 B_10 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9
1  1   1    8   3   8   3   3   4   3   4   6   4    4   2   3   3   6   4   8   6   1
2  2   7    6   5   6   3   1   6   4   1   3   3    6   7   1   5   5   2  10  10   6
3  3   4    3   8   5   9   7   9   7   3   1   8    2   7   6   8   3   5   6   9   4
4  4   5    4   8   2   4   1   4   6   2   2   1    1   7  10   6   9   7   7  10   1
5  5   4    1   5  10   3   5   3  10   8   3   7    3   4   6   6   9  10   7   4   5

the equivalent in long format:

dfflat %>%
  gather(participant,answer,-ID) %>%
  separate(participant,c('group','number'))

    ID group number answer
1    1     A      1      1
2    2     A      1      7
3    3     A      1      4
4    4     A      1      5
5    5     A      1      4
6    1     A     10      8
7    2     A     10      6
8    3     A     10      3
9    4     A     10      4
10   5     A     10      1
11   1     A      2      3
12   2     A      2      5
13   3     A      2      8
14   4     A      2      8
15   5     A      2      5
16   1     A      3      8
17   2     A      3      6
18   3     A      3      5
19   4     A      3      2
20   5     A      3     10
...

Test the hypothesis with t.test per ID and extract the p.value

dfflat %>%
  gather(participant,answer,-ID) %>%
  separate(participant,c('group','number')) %>%
  group_by(ID) %>%
  nest() %>%
  mutate(test = map(data, ~ with(.x, t.test(answer[group == 'A'],answer[group == 'B']))),
         p.value = map_dbl(test,pluck,'p.value'))

results in:

# A tibble: 5 x 4
     ID data              test        p.value
  <int> <list>            <list>        <dbl>
1     1 <tibble [20 x 3]> <S3: htest>   0.841
2     2 <tibble [20 x 3]> <S3: htest>   0.284
3     3 <tibble [20 x 3]> <S3: htest>   0.863
4     4 <tibble [20 x 3]> <S3: htest>   0.137
5     5 <tibble [20 x 3]> <S3: htest>   0.469

来源：https://stackoverflow.com/questions/57990378/t-test-for-multiple-rows-in-r

标签

statistics

t-test