I tried searching but didn\'t find an answer to this question.
I\'m trying to use the select statement in dplyr but am having problems when I try to send it strings
Select seems to work with the column indexes (dplyr 0.2), so just match your desired names to their index and use them to select the columns.
myCols <- c("mpg","disp")
colNums <- match(myCols,names(mtcars))
mtcars %>% select(colNums)
You can use get()
to get the object named by the string in the current environment. So:
R> iris %>% select(Species, Petal.Length) %>% head(3)
Species Petal.Length
1 setosa 1.4
2 setosa 1.4
3 setosa 1.3
R> iris %>% select('Species', 'Petal.Length') %>% head(3)
Error in abs(ind[ind < 0]) :
non-numeric argument to mathematical function
R> iris %>% select(get('Species'), get('Petal.Length')) %>% head(3)
Species Petal.Length
1 setosa 1.4
2 setosa 1.4
3 setosa 1.3
R> s <- 'Species'
R> p <- 'Petal.Length'
R> iris %>% select(get(s), get(p)) %>% head(3)
Species Petal.Length
1 setosa 1.4
2 setosa 1.4
3 setosa 1.3
I figured it out through trial and error. If anybody is curious, did something like this:
lapply(names.gens, as.name)
select(df.main.scaled, eval(names.gens[[i]]), eval(names.gens[[i+someindex]]))
[Edit - some of the below is now out of date with the release of dplyr 0.7 - see here]
The question is about the difference between standard evaluation and non standard evaluation.
tl;dnr: You can use the 'standard evaluation' counterpart of dplyr::select
, which is dplyr::select_
.
This allows you to provide column names as variables which contain strings:
dplyr::select_(df.main.scaled, names.gens[i,1], names.gens[i,2])
Here is lots more detail that tries to explain how this works:
Non-standard evaluation is the evaluation of code in non-standard ways. Often, this means capturing expressions before they are evaluated, and evaluating them in a different environment (context/scope) to normal. When you provide dplyr::select
with column names without quotation marks, dplyr
is using non-standard evaluation to interpret them as columns.
Supposing we have the following data frame:
df <- tibble::data_frame(a = 1:5, b = 6:10, c = 11:15, d = 16:20)
A simple example of the select statement is as follows:
r <- dplyr::select(df, a, b)
This is an example of NSE because a and b are not variables that exist in the global environment. Instead of searching for a and b in the global namespace, dplyr::select
directs R to search for the variables a and b in the context of dataframe df
. You can think of the environment a bit like a list, and a and b as keys. So the following is a bit like telling R to lookup df$a
and df$b
Function arguments in R are promises which are not evaluated immediately. They can be captured as expressions and then run in a different environment.
This is fine if we know we want to select the columns a
and b
in advance. But what if these columns are unknown in advance, and are held in a variable.
columns_to_select <- c("a", "b")
The following does not work:
dplyr::select(df, columns_to_select)
This error is telling us that there is no column called 'columns_to_select' in the dataframe. The argument columns_to_select
has been evaluated in the context of the dataframe, so R has tried to do something like df$columns_to_select
, and found that the column does not exist.
How do we fix this?
Tidyverse functions always provide an 'escape hatch' that allow you to get around this limitation. The dplyr vignette says 'Every function in dplyr that uses NSE also has a version that uses SE. The name of the SE version is always the NSE name with an _ on the end.'
What does this mean?
We might try the following, but we find it does not work:
# Does not work
r <-dplyr::select_(df, columns_to_select)
As opposed to capturing the argument columns_to_select
to the select_ function and interpreting it as a column name, columns_to_select
is evaluated in a standard way, resolving to c("a", "b")
.
That's what we want, except that each argument to select_
is a single column, and we've just provided a character vector of length two to represent a single column.
The above code therefore returns a tibble with a single column, a, which is not what we wanted. (Only the first element - "a"
in the character vector is used, everything else is ignored).
One solution to this problem is as follows, but it assumes that columns_to_select
contains exactly two elements:
col1 <- columns_to_select[1]
col2 <- columns_to_select[2]
r <- dplyr::select_(df,col1, col2)
How do we generalise this to the case where columns_to_select
may have an arbitrary number of elements?
The solution is to use the optional .dots
argument.
dplyr::select_(df, .dots=columns_to_select)
This bears some explanation
In R, the ... construct allows the creation of functions with a variable (arbitrary) number of arguments. The ...
is available within the function, and allows the function body to access all of the arguments. See also here.
A very simple example is as follows:
addition <- function(...) {
args <- list(...)
sum(unlist(args))
}
r <- addition(1,2,3)
However, this doesn't immediately help us here. It's actually already implemented in the select_
function and merely enables us to provide an arbitrary number of column names as arguments, e.g. select_(df, "a", "b", "c", "d")
.
What we need is a mechanism that is similar to ...
, but allows us to pass something like ...
into the function as a single argument. This is exactly what .dots
does.
Note that .dots
is not provided by select
, because this is designed to be used interactively.
I ran across this and I thought I should mention that this has been solved in newer versions of dplyr.
myTest = data_frame(
var1 = 1,
var2 = 2,
var3 = 3,
var4 = 4)
i = 1
myTest %>%
select_(.dots =
c(names.gens[i,1], names.gens[i,2]) %>% unname)
There is a workaround in dplyr 0.1.2 using regular expressions and matches (see hadley's comment below for information on direct support in future versions). A regular expression such as ^(x1|x2|x3)$
matches exact variables names so we just have to construct such an expression from a vector with variables names. Here is the code
# load libraries
library(dplyr)
library(stringr)
# create data.frame
df = data.frame(
x = rep(0,5),
y = 1,
var = 2,
another_var = 5,
var.4 = 6
)
# function to construct reg exp from vector with variable names
varlist = function(x) {
x = str_c('^(',paste(x, collapse='|'),')$')
x = str_replace_all(x,'\\.','\\\\.')
return(x)
}
# select variables based on vector of variable names
vars = c('y','another_var','var.4')
df %>%
select(matches(varlist(vars)))