I\'m struggling with finding an efficient solution for the following problem:
I have a large manipulated data frame with around 8 columns and 80000 rows that generally i
This is a good task for the split-apply-combine paradigm. First, you split your data frame by company/year pair:
data = data.frame(company.raw = c("C1", "C1", "C2", "C2", "C2", "C2"),
years.raw = c(1, 1, 1, 1, 2, 2),
source = c("Ink", "Recycling", "Coffee", "Combusted", "Printer", "Tea"),
amount.inkg = c(5, 2, 10, 15, 14, 18))
spl = split(data, paste(data$company.raw, data$years.raw))
Now, you compute the rolled-up data frame for each element in the split-up data:
spl2 = lapply(spl, function(x) {
data.frame(Company=x$company.raw[1],
Year=x$years.raw[1],
amount.vector1 = sum(x$amount.inkg[x$source %in% vector1]),
amount.vector2 = sum(x$amount.inkg[x$source %in% vector2]),
amount.vector3 = sum(x$amount.inkg[x$source %in% vector3]))
})
And finally, combine everything together:
do.call(rbind, spl2)
# Company Year amount.vector1 amount.vector2 amount.vector3
# C1 1 C1 1 0 5 2
# C2 1 C2 1 10 0 15
# C2 2 C2 2 18 14 0
Your data is in 'long form' (multiple rows of company, source, year, ...)
You want to aggregate amount.inkg over each company and year, for multiple values of source. Specifically you want to aggregate with conditionals on 'source' field.
Again, please give us reproducible example. (Thanks josilber). This is a four-liner with either Split-Apply-Combine(ddply) or logical indexing:
df = data.frame(company.raw = c("C1", "C1", "C2", "C2", "C2", "C2"),
years.raw = c(1, 1, 1, 1, 2, 2),
source = c("Ink", "Recycling", "Coffee", "Combusted", "Printer", "Tea"),
amount.inkg = c(5, 2, 10, 15, 14, 18))
# OPTION 1. Split-Apply-Combine: ddply(...summarize) with a conditional on the data
require(plyr) # dplyr if performance on large d.f. becomes an issue
ddply(df, .(company.raw,years.raw), summarize,
amount.vector1=sum(amount.inkg[source %in% c('Tea','Coffee')]),
amount.vector2=sum(amount.inkg[source %in% c('Ink','Printer')]),
amount.vector3=sum(amount.inkg[source %in% c('Recycling','Combusted')])
)
# OPTION 2. sum with logical indexing on the df:
# (This is from before you modified the question to one-row-per-company-and-per-year)
df$amount.vector1 <- sum( df[(df$source %in% c('Tea','Coffee')),]$amount.inkg )
# josilber clarifies you want one-row-per-company
...
Option 3. You could also use aggregate
(manpage here) with subset(...)
, although aggregate for a sum is overkill.
aggregate(df, source %in% c('Tea','Coffee'), FUN = sum)
The by
argument to aggregate is where the action is (selecting, subsetting by criteria).
Note: %in%
performs a scan operation, so if your vector and d.f. get large, or for scalability, you'd need to break it into boolean operations which can be vectorized:
(source=='Tea' | source=='Coffee')
As to preventing NA sums if the subset was empty, sum(c()) = 0
so don't worry about that. But if you do, either use na.omit, or do ifelse(is.na(x),0,x)
on the final result.