Efficient conditional summing by multiple conditions in R

孤街醉人 提交于 2019-12-02 12:02:17

Your data is in 'long form' (multiple rows of company, source, year, ...)

You want to aggregate amount.inkg over each company and year, for multiple values of source. Specifically you want to aggregate with conditionals on 'source' field.

Again, please give us reproducible example. (Thanks josilber). This is a four-liner with either Split-Apply-Combine(ddply) or logical indexing:

df = data.frame(company.raw = c("C1", "C1", "C2", "C2", "C2", "C2"),
                years.raw = c(1, 1, 1, 1, 2, 2),
                source = c("Ink", "Recycling", "Coffee", "Combusted", "Printer", "Tea"),
                amount.inkg = c(5, 2, 10, 15, 14, 18))

# OPTION 1. Split-Apply-Combine: ddply(...summarize) with a conditional on the data
require(plyr) # dplyr if performance on large d.f. becomes an issue
ddply(df, .(company.raw,years.raw), summarize,
    amount.vector1=sum(amount.inkg[source %in% c('Tea','Coffee')]),
    amount.vector2=sum(amount.inkg[source %in% c('Ink','Printer')]),
    amount.vector3=sum(amount.inkg[source %in% c('Recycling','Combusted')])
)


# OPTION 2. sum with logical indexing on the df:
# (This is from before you modified the question to one-row-per-company-and-per-year)
df$amount.vector1 <- sum( df[(df$source %in% c('Tea','Coffee')),]$amount.inkg )
# josilber clarifies you want one-row-per-company
...

Option 3. You could also use aggregate(manpage here) with subset(...), although aggregate for a sum is overkill.

aggregate(df, source %in% c('Tea','Coffee'), FUN = sum)

The by argument to aggregate is where the action is (selecting, subsetting by criteria).

Note: %in% performs a scan operation, so if your vector and d.f. get large, or for scalability, you'd need to break it into boolean operations which can be vectorized: (source=='Tea' | source=='Coffee')

As to preventing NA sums if the subset was empty, sum(c()) = 0 so don't worry about that. But if you do, either use na.omit, or do ifelse(is.na(x),0,x) on the final result.

This is a good task for the split-apply-combine paradigm. First, you split your data frame by company/year pair:

data = data.frame(company.raw = c("C1", "C1", "C2", "C2", "C2", "C2"),
                  years.raw = c(1, 1, 1, 1, 2, 2),
                  source = c("Ink", "Recycling", "Coffee", "Combusted", "Printer", "Tea"),
                  amount.inkg = c(5, 2, 10, 15, 14, 18))
spl = split(data, paste(data$company.raw, data$years.raw))

Now, you compute the rolled-up data frame for each element in the split-up data:

spl2 = lapply(spl, function(x) {
  data.frame(Company=x$company.raw[1],
             Year=x$years.raw[1],
             amount.vector1 = sum(x$amount.inkg[x$source %in% vector1]),
             amount.vector2 = sum(x$amount.inkg[x$source %in% vector2]),
             amount.vector3 = sum(x$amount.inkg[x$source %in% vector3]))
})

And finally, combine everything together:

do.call(rbind, spl2)
#      Company Year amount.vector1 amount.vector2 amount.vector3
# C1 1      C1    1              0              5              2
# C2 1      C2    1             10              0             15
# C2 2      C2    2             18             14              0
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!