I would like to be able to use dplyr
\'s split-apply-combine strategy to the apply the summary()
command.
Take a simple data frame:
The problem is that dplyr
's do()
only works with with input of the form data.frame
.
The broom package's tidy()
function can be used to convert outputs of summary()
to data.frame
.
df %>%
group_by(class) %>%
do( tidy(summary(.$value)) )
This gives:
Source: local data frame [2 x 7]
Groups: class [2]
class minimum q1 median mean q3 maximum
(fctr) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1 A 100 105 110 110 115 120
2 B 800 820 840 840 860 880
You can use the SE version of data_frame
, that is, data_frame_
and perform:
df %>%
group_by(class) %>%
do(data_frame_(summary(.$value)))
Alternatively, you can use as.list()
wrapped by data.frame()
with the argument check.names = FALSE
:
df %>%
group_by(class) %>%
do(data.frame(as.list(summary(.$value)), check.names = FALSE))
Both versions produce:
# Source: local data frame [2 x 7]
# Groups: class [2]
#
# class Min. 1st Qu. Median Mean 3rd Qu. Max.
# (fctr) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
# 1 A 100 105 110 110 115 120
# 2 B 800 820 840 840 860 880
The behavior of do
will change depending on whether you give it a named or unnamed argument. For unnamed arguments, it expects a data.frame for each group, which will be binded together. For named arguments it will make a row for each group, and put whatever the output is into a new variable with that name.
So in this case we it will complain for unnamed use (summary
does not produce a data.frame) but the named use will work:
df %>%
group_by(class) %>%
do(summaries = summary(.$value)) ->
df2
Which gives:
Source: local data frame [2 x 2]
Groups: <by row>
class summaries
(fctr) (chr)
1 A <S3:summaryDefault, table>
2 B <S3:summaryDefault, table>
We can access a summary like this:
df2$summaries[[1]]
Giving:
Min. 1st Qu. Median Mean 3rd Qu. Max.
100 105 110 110 115 120
Getting all of these as new columns for df
can only be done by first converting the output to a data.frame, as can be seen in the other answers.
So the root of the problem here is that summary
outputs a table
instead of a data.frame.