I have a data frame where some consecutive columns have the same name. I need to search for these, add their values in for each row, drop one column and replace the other wi
> dfrm <- data.frame(a = 1:10, b= 1:10, cc= 1:10, dd=1:10, ee=1:10)
> names(dfrm) <- c("a", "a", "b", "b", "b")
> sapply(unique(names(dfrm)[duplicated(names(dfrm))]),
function(x) Reduce("+", dfrm[ , grep(x, names(dfrm))]) )
a b
[1,] 2 3
[2,] 4 6
[3,] 6 9
[4,] 8 12
[5,] 10 15
[6,] 12 18
[7,] 14 21
[8,] 16 24
[9,] 18 27
[10,] 20 30
EDIT 2: Using rowSums allows simplification of the first sapply argumentto just unique(names(dfrm))
at the expense of needing to remember to include drop=FALSE in "[":
sapply(unique(names(dfrm)),
function(x) rowSums( dfrm[ , grep(x, names(dfrm)), drop=FALSE]) )
To deal with NA's:
sapply(unique(names(dfrm)),
function(x) apply(dfrm[grep(x, names(dfrm))], 1,
function(y) if ( all(is.na(y)) ) {NA} else { sum(y, na.rm=TRUE) }
) )
(Edit note: addressed Tommy counter-example by putting unique around the names(.)[.] construction. The erroneous code was:
sapply(names(dfrm)[unique(duplicated(names(dfrm)))],
function(x) Reduce("+", dfrm[ , grep(x, names(dfrm))]) )
Some sample data.
dfr <- data.frame(
foo = rnorm(20),
bar = 1:20,
bar = runif(20),
check.names = FALSE
)
Method: Loop over unique column names; if there is only one of that name, then selecting all columns with that nme will return a vector, but if there are duplicates it will also be a data frame. Use rowSums
to sum over rows. (Duh. EDIT: Not quite as 'duh' as previously thought!) EDIT: lapply
returns a list, which we need to reform into a data frame, and finally we fix the names.sapply
avoids the need for the last step.
unique_col_names <- unique(colnames(dfr))
new_dfr <- sapply(unique_col_names, function(name)
{
subs <- dfr[, colnames(dfr) == name]
if(is.data.frame(subs))
rowSums(subs)
else
subs
})
Here is my one liner
# transpose data frame, sum by group = rowname, transpose back.
t(rowsum(t(dfrm), group = rownames(t(dfrm))))
One way is to identify duplcates using (surprise) the duplicated
function, and then loop through them to calculate the sums. Here is an example:
dat.dup <- data.frame(x=1:10, x=1:10, x=1:10, y=1:10, y=1:10, z=1:10, check.names=FALSE)
dups <- unique(names(dat.dup)[duplicated(names(dat.dup))])
for (i in dups) {
dat.dup[[i]] <- rowSums(dat.dup[names(dat.dup) == i])
}
dat <- dat.dup[!duplicated(names(dat.dup))]