问题
I am hoping to determine an efficient way to convert a list of data frames into a single data frame. Below is my reproducible MWE:
set.seed(1)
ABAge = runif(100)
ABPoints = rnorm(100)
ACAge = runif(100)
ACPoints = rnorm(100)
BCAge = runif(100)
BCPoints = rnorm(100)
A_B <- data.frame(ID = as.character(paste0("ID", 1:100)), Age = ABAge, Points = ABPoints)
A_C <- data.frame(ID = as.character(paste0("ID", 1:100)), Age = ACAge, Points = ACPoints)
B_C <- data.frame(ID = as.character(paste0("ID", 1:100)), Age = BCAge, Points = BCPoints)
A_B$ID <- as.character(A_B$ID)
A_C$ID <- as.character(A_C$ID)
B_C$ID <- as.character(B_C$ID)
listFormat <- list("A_B" = A_B, "A_C" = A_C, "B_C" = B_C)
dfFormat <- data.frame(ID = as.character(paste0("ID", 1:100)), A_B.Age = ABAge, A_B.Points = ABPoints, A_C.Age = ACAge, A_C.Points = ACPoints, B_C.Age = BCAge, B_C.Points = BCPoints)
dfFormat$ID = as.character(dfFormat$ID)
This results in a data frame format (dfFormat
) that looks like this:
'data.frame': 100 obs. of 7 variables:
$ ID : chr "ID1" "ID2" "ID3" "ID4" ...
$ A_B.Age : num 0.266 0.372 0.573 0.908 0.202 ...
$ A_B.Points: num 0.398 -0.612 0.341 -1.129 1.433 ...
$ A_C.Age : num 0.6737 0.0949 0.4926 0.4616 0.3752 ...
$ A_C.Points: num 0.409 1.689 1.587 -0.331 -2.285 ...
$ B_C.Age : num 0.814 0.929 0.147 0.75 0.976 ...
$ B_C.Points: num 1.474 0.677 0.38 -0.193 1.578 ...
and a list of data frames listFormat
that looks like this:
List of 3
$ A_B:'data.frame': 100 obs. of 3 variables:
..$ ID : chr [1:100] "ID1" "ID2" "ID3" "ID4" ...
..$ Age : num [1:100] 0.266 0.372 0.573 0.908 0.202 ...
..$ Points: num [1:100] 0.398 -0.612 0.341 -1.129 1.433 ...
$ A_C:'data.frame': 100 obs. of 3 variables:
..$ ID : chr [1:100] "ID1" "ID2" "ID3" "ID4" ...
..$ Age : num [1:100] 0.6737 0.0949 0.4926 0.4616 0.3752 ...
..$ Points: num [1:100] 0.409 1.689 1.587 -0.331 -2.285 ...
$ B_C:'data.frame': 100 obs. of 3 variables:
..$ ID : chr [1:100] "ID1" "ID2" "ID3" "ID4" ...
..$ Age : num [1:100] 0.814 0.929 0.147 0.75 0.976 ...
..$ Points: num [1:100] 1.474 0.677 0.38 -0.193 1.578 ...
I am hoping to come up with an automated way to convert the dfFormat
to listFormat
. As can be seen in the above objects there are two main conditions:
1) If there is a common column (name and contents) in each sublist of listFormat
(in these examples ID
), then they are not repeated in the outputted dfFormat
(in this example, it has one final ID
column),
2) The rest of the column names in sublists of listFormat
become columns in dfFormat
and have names such that they retain their sublist name (i.e "A_B") followed by a dot and then their original column name (i.e. Age), so that it becomes (i.e. "A_B.Age") in the dfFormat
.
I have tried various unlist()
and sapply
codes but have been unsuccessful thus far. What is an efficient way to accomplish this?
回答1:
Copy listFormat
to L
in case we need to preserve the input, listFormat
. Remove the ID
column from each component of L
except the first, cbind
what we have left together and then fix up the name of the first column. No packages are used.
L <- listFormat
L[-1] <- lapply(L[-1], transform, ID = NULL)
DF <- do.call(cbind, L)
names(DF)[1] <- "ID"
giving:
> str(DF)
'data.frame': 100 obs. of 7 variables:
$ ID : chr "ID1" "ID2" "ID3" "ID4" ...
$ A_B.Age : num 0.9932 0.1451 0.6166 0.0372 0.9039 ...
$ A_B.Points: num 0.4752 0.0288 1.0548 0.6113 0.0651 ...
$ A_C.Age : num 0.912 0.761 0.618 0.895 0.507 ...
$ A_C.Points: num -0.515 -0.945 0.398 0.502 -1.021 ...
$ B_C.Age : num 0.7935 0.2747 0.0487 0.6307 0.3499 ...
$ B_C.Points: num -0.963 -1.772 1.716 -0.819 0.577 ...
回答2:
You're looking for dplyr::bind_rows
:
library(dplyr)
bind_rows(listFormat, .id = "name")
Output:
name ID Age Points
1 A_B ID1 0.2655087 0.3981059
2 A_B ID2 0.3721239 -0.6120264
3 A_B ID3 0.5728534 0.3411197
4 A_B ID4 0.9082078 -1.1293631
5 A_B ID5 0.2016819 1.4330237
6 A_B ID6 0.8983897 1.9803999
回答3:
One purrr
and dplyr
option could be:
imap(listFormat, ~ setNames(.x, paste(rep(.y, length(.x)), names(.x), sep = ".")) %>%
rename_at(vars(ends_with("ID")), ~ "ID")) %>%
reduce(full_join, by = "ID")
ID A_B.Age A_B.Points A_C.Age A_C.Points B_C.Age B_C.Points
1 ID1 0.26550866 0.398105880 0.67371223 0.409401840 0.81425175 1.473881181
2 ID2 0.37212390 -0.612026393 0.09485786 1.688873286 0.92877723 0.677268492
3 ID3 0.57285336 0.341119691 0.49259612 1.586588433 0.14748105 0.379962687
4 ID4 0.90820779 -1.129363096 0.46155184 -0.330907801 0.74982166 -0.192798426
5 ID5 0.20168193 1.433023702 0.37521653 -2.285235535 0.97565735 1.577891795
6 ID6 0.89838968 1.980399899 0.99109922 2.497661590 0.97479246 0.596234109
7 ID7 0.94467527 -0.367221476 0.17635071 0.667066167 0.35062557 -1.173576941
8 ID8 0.66079779 -1.044134626 0.81343521 0.541327336 0.39394906 -0.155642535
9 ID9 0.62911404 0.569719627 0.06844664 -0.013399523 0.95095101 -1.918909820
10 ID10 0.06178627 -0.135054604 0.40044975 0.510108423 0.10664832 -0.195258846
回答4:
Given that each data.frame has identical ID columns, in base R it's pretty easy.
as.data.frame(listFormat)
# A_B.ID A_B.Age A_B.Points A_C.ID A_C.Age A_C.Points B_C.ID B_C.Age B_C.Points
# 1 ID1 0.2655087 0.3981059 ID1 0.67371223 0.4094018 ID1 0.8142518 1.4738812
# 2 ID2 0.3721239 -0.6120264 ID2 0.09485786 1.6888733 ID2 0.9287772 0.6772685
# 3 ID3 0.5728534 0.3411197 ID3 0.49259612 1.5865884 ID3 0.1474810 0.3799627
# 4 ID4 0.9082078 -1.1293631 ID4 0.46155184 -0.3309078 ID4 0.7498217 -0.1927984
# 5 ID5 0.2016819 1.4330237 ID5 0.37521653 -2.2852355 ID5 0.9756573 1.5778918
# 6 ID6 0.8983897 1.9803999 ID6 0.99109922 2.4976616 ID6 0.9747925 0.5962341
You get an ID column for each data.frame, but this can then be easily tidied up
In case you need a more general solution for situations where the id columns of each data.frame differ, then you can do the following using library(data.table)
DTFormat = rbindlist(listFormat, idcol = T)
dcast(DTFormat, ID~.id, value.var = c('Age', 'Points'))
来源:https://stackoverflow.com/questions/60248706/convert-a-list-of-data-frames-into-a-single-data-frame-with-list-name