I want to start using dplyr in place of ddply but I can't get a handle on how it works (I've read the documentation).
For example, why when I try to mutate() something does the "group_by" function not work as it's supposed to?
Looking at mtcars:
library(car)
Say I make a data.frame which is a summary of mtcars, grouped by "cyl" and "gear":
df1 <- mtcars %.%
group_by(cyl, gear) %.%
summarise(
newvar = sum(wt)
)
Then say I want to further summarise this dataframe. With ddply, it'd be straightforward, but when I try to do with with dplyr, it's not actually "grouping by":
df2 <- df1 %.%
group_by(cyl) %.%
mutate(
newvar2 = newvar + 5
)
Still yields an ungrouped output:
cyl gear newvar newvar2
1 6 3 6.675 11.675
2 4 4 19.025 24.025
3 6 4 12.375 17.375
4 6 5 2.770 7.770
5 4 3 2.465 7.465
6 8 3 49.249 54.249
7 4 5 3.653 8.653
8 8 5 6.740 11.740
Am I doing something wrong with the syntax?
Edit:
If I were to do this with plyr and ddply:
df1 <- ddply(mtcars, .(cyl, gear), summarise, newvar = sum(wt))
and then to get the second df:
df2 <- ddply(df1, .(cyl), summarise, newvar2 = sum(newvar) + 5)
But that same approach, with sum(newvar) + 5 in the summarise() function doesn't work with dplyr...
Taking Dickoa's answer one step further -- as Hadley says "summarise peels off a single layer of grouping". It peels off grouping from the reverse order in which you applied it so you can just use
mtcars %>%
group_by(cyl, gear) %>%
summarise(newvar = sum(wt)) %>%
summarise(newvar2 = sum(newvar) + 5)
Note that this will give a different answer if you use group_by(gear, cyl)
in the second line.
And to get your first attempt working:
df1 <- mtcars %>%
group_by(cyl, gear) %>%
summarise(newvar = sum(wt))
df2 <- df1 %>%
group_by(cyl) %>%
summarise(newvar2 = sum(newvar)+5)
I had a similar problem. I found that simply detaching plyr
solved it:
detach(package:plyr)
library(dplyr)
If you translate your plyr
code into dplyr
using summarise
instead of mutate
you get the same results.
library(plyr)
df1 <- ddply(mtcars, .(cyl, gear), summarise, newvar = sum(wt))
df2 <- ddply(df1, .(cyl), summarise, newvar2 = sum(newvar) + 5)
df2
## cyl newvar2
## 1 4 30.143
## 2 6 26.820
## 3 8 60.989
detach(package:plyr)
library(dplyr)
mtcars %.%
group_by(cyl, gear) %.%
summarise(newvar = sum(wt)) %.%
group_by(cyl) %.%
summarise(newvar2 = sum(newvar) + 5)
## cyl newvar2
## 1 4 30.143
## 2 8 60.989
## 3 6 26.820
EDIT
Since summarise
drops the last group (gear
) you can skip the second group_by
(see @hadley comment below)
library(dplyr)
mtcars %.%
group_by(cyl, gear) %.%
summarise(newvar = sum(wt)) %.%
summarise(newvar2 = sum(newvar) + 5)
## cyl newvar2
## 1 4 30.143
## 2 8 60.989
## 3 6 26.820
Detaching plyr
is one way to solve the problem so you can use dplyr
functions as desired... but what if you need other functions from plyr
to complete other tasks in your code?
(In this example, I've got both dplyr
and plyr
libraries loaded)
Suppose we have a simple data.frame and we want to compute the groupwise sum of the variable value
, when grouped by different levels of gname
> dx<-data.frame(gname=c(1,1,1,2,2,2,3,3,3), value = c(2,2,2,4,4,4,5,6,7))
> dx
gname value
1 1 2
2 1 2
3 1 2
4 2 4
5 2 4
6 2 4
7 3 5
8 3 6
9 3 7
But when we try to use what we believe will produce a dplyr
grouped sum, here's what happens:
dx %>% group_by(gname) %>% mutate(mysum=sum(value))
Source: local data frame [9 x 3]
Groups: gname
gname value mysum
1 1 2 36
2 1 2 36
3 1 2 36
4 2 4 36
5 2 4 36
6 2 4 36
7 3 5 36
8 3 6 36
9 3 7 36
It doesn't give us the desired answer. Probably because of some interaction or overloading of the group_by
and or mutate
functions between dplyr
and plyr
. We could detach plyr
, but another way is to give a unique call to the dplyr
versions of group_by
and mutate
:
dx %>% dplyr::group_by(gname) %>% dplyr::mutate(mysum=sum(value))
Source: local data frame [9 x 3]
Groups: gname
gname value mysum
1 1 2 6
2 1 2 6
3 1 2 6
4 2 4 12
5 2 4 12
6 2 4 12
7 3 5 18
8 3 6 18
9 3 7 18
now we see that this works as expected.
dplyr is working as you should expect in your example. Mutate, as you specified it, will just add 5 to each value of newvar as it creates newvar2. This would look the same if you group or not. If, however, you specify something that differs by group you will get something different. For example:
df1 %.%
group_by(cyl) %.%
mutate(
newvar2 = newvar + mean(cyl)
)
来源:https://stackoverflow.com/questions/21653295/dplyr-issues-when-using-group-bymultiple-variables