问题
Quite often, I find myself manually combining select() and mutate() functions within dplyr. This is usually because I'm tidying up a dataframe, want to create new columns based on the old columns, and only want keep the new columns.
For example, if I had data about heights and widths but only wanted to use them to calculate and keep the area then I would use:
library(dplyr)
df <- data.frame(height = 1:3, width = 10:12)
df %>%
mutate(area = height * width) %>%
select(area)
When there are a lot of variables being created in the mutate step it can be difficult to make sure they're all in the select step. Is there a more elegant way to only keep the variables defined in the mutate step?
One workaround I've been using is the following:
df %>%
mutate(id = row_number()) %>%
group_by(id) %>%
summarise(area = height * width) %>%
ungroup() %>%
select(-id)
This works but is pretty verbose, and the use of summarise() means there's a performance hit:
library(microbenchmark)
microbenchmark(
df %>%
mutate(area = height * width) %>%
select(area),
df %>%
mutate(id = row_number()) %>%
group_by(id) %>%
summarise(area = height * width) %>%
ungroup() %>%
select(-id)
)
Output:
min lq mean median uq max neval cld
868.822 954.053 1258.328 1147.050 1363.251 4369.544 100 a
1897.396 1958.754 2319.545 2247.022 2549.124 4025.050 100 b
I'm thinking there's another workaround where you can compare the original dataframe names with the new dataframe names and take the right complement, but maybe there's a better way?
I feel like I'm missing something really obvious in the dplyr documentation, so apologies if this is trivial!
回答1:
Just create your own function that combines the two steps:
mutate_only = function (.data, ...) {
names = names(match.call(expand.dots = FALSE)$...)
.data %>% mutate(...) %>% select(one_of(names))
}
This needs some work to function properly with standard evaluation. Unfortunately the dplyr API is currently evolving on that point so I don’t know what the recommendation for this will be in a few weeks’ time. Therefore I’ll just refer to the relevant documentation.
回答2:
Just to give more visibility to @Nate's comment, transmute()
is the way to go!! From its description:
mutate() adds new variables and preserves existing; transmute() drops existing variables.
edit: to give a working example,
df %>%
transmute(area = height * width)
is the same as
df %>%
mutate(area = height * width) %>%
select(area)
来源:https://stackoverflow.com/questions/43375156/combine-select-and-mutate