I have a dataset such that the same variable is contained in difference columns for each subject. I want to merge them to the same columns.
E.g.:, I have this dataf
For the sake of completeness, here is also a data.table
solution using melt()
to reshape two measure variables simultaneously:
library(data.table)
cols <- c("DV1", "DV2")
melt(setDT(DF), measure.vars = patterns(cols), value.name = cols, na.rm = TRUE)[
, -"variable"]
ID FACT DV1 DV2 1: 1 A 1 3 2: 2 B 4 3 3: 3 C 5 5
Now, the six columns have been merged to just two columns as requested by the OP.
However, the OP has given a data.frame with the expected result where the new columns are appended to the existing columns. This can be achieved by joining above result with the original data frame:
setDT(DF)[melt(DF, measure.vars = patterns(cols), value.name = cols, na.rm = TRUE)[
, -"variable"], on = .(ID, FACT)]
ID DV1_A DV1_B DV1_C DV2_A DV2_B DV2_C FACT DV1 DV2 1: 1 1 NA NA 3 NA NA A 1 3 2: 2 NA 4 NA NA 3 NA B 4 3 3: 3 NA NA 5 NA NA 5 C 5 5
You can use coalesce
from dplyr
:
library(dplyr)
df %>%
mutate(DV_1 = coalesce(DV1_A, DV1_B, DV1_C),
DV_2 = coalesce(DV2_A, DV2_B, DV2_C))
If you have a lot of DV
columns to combine, you might not want to type all the column names. In this case, you can first grep
the column names for each DV
, parse each name to symbols with rlang::syms
, then splice (!!!
) the symbols in coalesce
(Advice from @hadley):
library(rlang)
var_quo1 = syms(grep("DV1", names(df), value = TRUE))
var_quo2 = syms(grep("DV2", names(df), value = TRUE))
df %>%
mutate(DV_1 = coalesce(!!! var_quo1),
DV_2 = coalesce(!!! var_quo2))
If instead, you have a ton of DV
's, you might not even want to type all the coalesce
lines, in this case, you can create a function that outputs one DV
column given an input number and lapply
+ bind_col
all of them together:
DV_combine = function(num_DVs){
DV_name = sym(paste0("DV", num_DVs))
DV_syms = syms(grep(paste0("DV", num_DVs), names(df), value = TRUE))
df %>%
transmute(!!DV_name := coalesce(!!! DV_syms))
}
bind_cols(df, lapply(1:2, DV_combine))
Result:
ID DV1_A DV1_B DV1_C DV2_A DV2_B DV2_C FACT DV_1 DV_2
1 1 1 NA NA 3 NA NA A 1 3
2 2 NA 4 NA NA 3 NA B 4 3
3 3 NA NA 5 NA NA 5 C 5 5
Note:
This method will work for both numeric
and character
class columns, but not factor
's. One should first convert the factor
columns to character before using this method.
Data:
df = structure(list(ID = c(1, 2, 3), DV1_A = c(1, NA, NA), DV1_B = c(NA,
4, NA), DV1_C = c(NA, NA, 5), DV2_A = c(3, NA, NA), DV2_B = c(NA,
3, NA), DV2_C = c(NA, NA, 5), FACT = structure(1:3, .Label = c("A",
"B", "C"), class = "factor")), .Names = c("ID", "DV1_A", "DV1_B",
"DV1_C", "DV2_A", "DV2_B", "DV2_C", "FACT"), row.names = c(NA,
-3L), class = "data.frame")
The base transform
will do this:
d <- transform(d,
DV1 = rowSums(d[c("DV1_A", "DV1_B", "DV1_C")], na.rm=T),
DV2 = rowSums(d[c("DV2_A", "DV2_B", "DV2_C")], na.rm=T)
)
Another solution similar to @userR, but rather than creating each column individually, this creates a list of expressions that get evaluated all at once. It may still suffer the same "don't splice data frames into calls with !!!
" fault that was mentioned in the comments since it uses select(.)
, but I thought I would post anyways.
library(rlang)
library(dplyr)
df <- data.frame(ID = c(1,2,3), DV1_A=c(1,NA,NA),
DV1_B= c(NA,4,NA), DV1_C = c(NA,NA,5),
DV2_A=c(3,NA,NA), DV2_B=c(NA,3,NA),
DV2_C=c(NA,NA,5), FACT = c("A","B","C"))
create_DV <- function(num) {
DV_name <- sym(paste0("DV_", num))
DV_char <- paste0("DV", num)
expr(!! DV_name := select(., contains(!! DV_char)) %>% rowSums(na.rm = TRUE))
}
DV_expr_list <- c(1,2) %>%
lapply(create_DV)
df %>%
mutate(
!!! DV_expr_list
)
#> ID DV1_A DV1_B DV1_C DV2_A DV2_B DV2_C FACT DV_1 DV_2
#> 1 1 1 NA NA 3 NA NA A 1 3
#> 2 2 NA 4 NA NA 3 NA B 4 3
#> 3 3 NA NA 5 NA NA 5 C 5 5
This will work, though not a very elegant solution when you could use the coalesce function already mentioned:
library(dplyr)
test <- df %>% group_by(ID) %>% summarise(DV1 = ifelse(!is.na(DV1_A),paste(DV1_A),ifelse(!is.na(DV1_B),paste(DV1_B),ifelse(!is.na(DV1_C),paste(DV1_C),""))), DV2 = ifelse(!is.na(DV2_A),paste(DV2_A),ifelse(!is.na(DV2_B),paste(DV2_B),ifelse(!is.na(DV2_C),paste(DV2_C),""))))
You could also do this via gather
and spread
with tidyr
and dplyr
. Less concise than @useR's solution, but might be useful if you need to do any intermediate manipulation.
library(dplyr)
library(tidyr)
df %>%
gather(variable, value, -ID, -FACT, na.rm = TRUE) %>%
mutate(variable = gsub("\\_[A-Z]", "", variable)) %>%
spread(variable, value) %>%
left_join(df)
ID FACT DV1 DV2 DV1_A DV1_B DV1_C DV2_A DV2_B DV2_C
1 1 A 1 3 1 NA NA 3 NA NA
2 2 B 4 3 NA 4 NA NA 3 NA
3 3 C 5 5 NA NA 5 NA NA 5