问题
I have a list with multiple entries, an example entry looks like:
> head(gene_sets[[1]])
patient Diagnosis Eigen_gene ENSG00000080824 ENSG00000166165 ENSG00000211459 ENSG00000198763 ENSG00000198938 ENSG00000198886
1 689_120604 AD -0.5606425 50137 38263 309298 528233 523420 730537
2 412_120503 AD 0.9454632 44536 23333 404316 730342 765963 1168123
3 706_120605 AD 0.6061834 16647 22021 409498 614314 762878 1171747
4 486_120515 AD 0.8164779 21871 9836 518046 697051 613621 1217262
5 469_120514 AD 0.5354927 33460 11651 468223 653745 608259 1115973
6 369_120502 AD -0.8363372 32168 44760 271978 436132 513194 784537
For these entries, the first three columns are always consistent and the total number of columns varies.
What I would like to do is convert this entire list into a dataframe. The information I need to retain is set_index
being the index of entry in the list, then all the colnames from beyond Eigen_gene
until the last column.
I can think of solutions using loops, however I would like a dplyr/reshape
solution.
To clarify, if we had a fake input that looked like:
> list(data.frame(patient= c(1,2,3), Diagnosis= c("AD","Control", "AD"), Eigen_gene= c(1.1, 2.3, 4.3), geneA= c(1,1,1), geneC= c(2,1,3), geneB= c(2,39,458)))
[[1]]
patient Diagnosis Eigen_gene geneA geneC geneB
1 1 AD 1.1 1 2 2
2 2 Control 2.3 1 1 39
3 3 AD 4.3 1 3 458
The desired output would look like this (I have only shown an example of the first list entry for input, the output shows how other entries in the list would also be formatted):
> data.frame(set_index= c(1,1,1,2,2,2,3,3), gene= c("geneA", "geneC", "geneB", "geneF", "geneE", "geneH", "geneT", "geneZ"))
set_index gene
1 1 geneA
2 1 geneC
3 1 geneB
4 2 geneF
5 2 geneE
6 2 geneH
7 3 geneT
8 3 geneZ
Thanks!
回答1:
Here is a solution from the tidyverse
and purrr
. I extended the example input to produce the example output. The key function here is imap
, which is shorthand for map2(x, seq_along(x))
. See the help for more. What we want to do is apply a function to each dataframe in the list and its index. So we use the function ~ tibble(set_index = .y, gene = colnames(.x[4:ncol(.x)]))
.
~
,.x
and.y
arepurrr
shorthands forfunction(x, y)
,x
andy
. This lets us refer to the arguments for the function compactly. See?map2
.set_index = .y
creates the first column and fills it with the index of the current dataframe (it's usefully repeated to be the right length)gene = colnames(.x[4:ncol(.x)]))
creates the second column from a vector of the gene names.colnames
gets the variable names of the data frame, but we subset to exclude the first three.- If we had just
imap
, we would get a list of data frames. Theimap_dfr
just takes that list and binds them together as rows, producing our desired output. (equivalent to callingbind_rows
afterwards)
library(tidyverse)
gene_list <- list(
data.frame(
patient= c(1,2,3),
Diagnosis= c("AD","Control", "AD"),
Eigen_gene= c(1.1, 2.3, 4.3),
geneA= c(1,1,1),
geneC= c(2,1,3),
geneB= c(2,39,458)
),
data.frame(
patient= c(1,2,3),
Diagnosis= c("AD","Control", "AD"),
Eigen_gene= c(1.1, 2.3, 4.3),
geneF= c(1,1,1),
geneE= c(2,1,3),
geneH= c(2,39,458)
),
data.frame(
patient= c(1,2,3),
Diagnosis= c("AD","Control", "AD"),
Eigen_gene= c(1.1, 2.3, 4.3),
geneT= c(1,1,1),
geneZ= c(2,1,3)
)
)
output <- gene_list %>%
imap_dfr(~ tibble(set_index = .y, gene = colnames(.x[4:ncol(.x)])))
output
#> # A tibble: 8 x 2
#> set_index gene
#> <int> <chr>
#> 1 1 geneA
#> 2 1 geneC
#> 3 1 geneB
#> 4 2 geneF
#> 5 2 geneE
#> 6 2 geneH
#> 7 3 geneT
#> 8 3 geneZ
Created on 2018-03-02 by the reprex package (v0.2.0).
来源:https://stackoverflow.com/questions/49076510/r-dplyr-convert-a-list-of-dataframes-into-a-single-organized-dataframe