问题
I have found that data.table and dplyr have differing results when trying to do the same thing. I would like to use dplyr syntax, but have it compute in the way that data.table does. The use case is that I want to add subtotals to a table. To do that, I need to do some aggregation to each variable, but then keep the same variable names (in the transformed version). Data.table allows me to perform some aggregation on a variable and keep the same name. Then do another aggregation with that same variable. It will continue to use the untransformed version. Dplyr, however, will use the transformed version.
In the summarize documentation it says:
# Note that with data frames, newly created summaries immediately
# overwrite existing variables
mtcars %>%
group_by(cyl) %>%
summarise(disp = mean(disp), sd = sd(disp))
This is basically the issue I am running into, but I'm wondering if there is a nice workaround. One thing I found was to just name the transformed variable something else then rename it at the end, but that does not look very nice to me. If there is a nice way to do subtotals, that'd be good to know as well. I looked around this site and did not see this exact situation discussed. Any help would be greatly appreciated!
Here I have made a simple example, once with data.table's results, and once with dplyr's. I want to take this simple table and append a subtotal row that is the weighted average of the column of interest (Total).
library(data.table)
library(dplyr)
dt <- data.table(Group = LETTERS[1:5],
Count = c(1000, 1500, 1200, 2000, 5000),
Total = c(50, 300, 600, 400, 1000))
dt[, Count_Dist := Count/sum(Count)]
dt[, .(Count_Dist = sum(Count_Dist), Weighted_Total = sum(Count_Dist*Total))]
dt <- rbind(dt[, .(Group, Count_Dist, Total)],
dt[, .(Group = "All", Count_Dist = sum(Count_Dist), Total = sum(Count_Dist*Total))])
setnames(dt, "Total", "Weighted_Avg_Total")
dt
df <- data.frame(Group = LETTERS[1:5],
Count = c(1000, 1500, 1200, 2000, 5000),
Total = c(50, 300, 600, 400, 1000))
df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
summarize(Count_Dist = sum(Count_Dist),
Weighted_Total = sum(Count_Dist*Total))
df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
select(Group, Count_Dist, Total) %>%
rbind(df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
summarize(Group = "All",
Count_Dist = sum(Count_Dist),
Total = sum(Count_Dist*Total))) %>%
rename(Weighted_Avg_Total = Total)
Thanks again for any help!
回答1:
A possible solution is to skip the mutate
steps and use transmute
for the first mutate
/select
-step and directly calculate the desired variables from the original variables without creating an intermediate variable for the second mutate
-step:
df %>%
transmute(Group, Count_Dist = Count/sum(Count), Weighted_Avg_Total = Total) %>%
bind_rows(df %>%
summarize(Group = "All",
Count_Dist = sum(Count/sum(Count)),
Weighted_Avg_Total = sum((Count/sum(Count))*Total)))
which gives:
Group Count_Dist Weighted_Avg_Total 1 A 0.09345794 50.0000 2 B 0.14018692 300.0000 3 C 0.11214953 600.0000 4 D 0.18691589 400.0000 5 E 0.46728972 1000.0000 6 All 1.00000000 656.0748
Another possible solution is to alter the order in which the new variables are calculated in dplyr
and then use select
to get the column-order back into what you originally wanted:
df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
select(Group, Count_Dist, Weighted_Avg_Total = Total) %>%
bind_rows(df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
summarize(Group = "All",
Weighted_Avg_Total = sum(Count_Dist*Total),
Count_Dist = sum(Count_Dist)) %>%
select(Group, Count_Dist, Weighted_Avg_Total))
If you want to include the Count
-column as well, you could do (based on my comment from below):
df %>%
transmute(Group = Group, Count_Dist = Count/sum(Count), Weighted_Avg_Total = Total, Count) %>%
bind_rows(df %>%
summarize(Group = "All",
Count_Dist = sum(Count/sum(Count)),
Weighted_Avg_Total = sum((Count/sum(Count))*Total),
Count = sum(Count)))
回答2:
One alternative could be to use mutate
twice to calculate even Weighted_Total
and use sum
of that column in summarize
.
df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
mutate(Weighted_Total = Count_Dist*Total) %>%
summarize(Count_Dist = sum(Count_Dist),
Weighted_Total = sum(Weighted_Total))
Result:
Count_Dist Weighted_Total
1 1 656.074766
And:
df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
select(Group, Count_Dist, Total) %>%
rbind(df %>%
mutate(Count_Dist = Count/sum(Count)) %>%
mutate(Weighted_Total = Count_Dist*Total) %>%
summarize(Group = "All",
Count_Dist = sum(Count_Dist),
Total = sum(Weighted_Total))) %>%
rename(Weighted_Avg_Total = Total)
Result:
Group Count_Dist Weighted_Avg_Total
1 A 0.0934579439 50.000000
2 B 0.1401869159 300.000000
3 C 0.1121495327 600.000000
4 D 0.1869158879 400.000000
5 E 0.4672897196 1000.000000
6 All 1.0000000000 656.074766
来源:https://stackoverflow.com/questions/48357867/using-dplyr-to-summarize-and-keep-the-same-variable-name