Let\'s say I have the following formula:
myformula<-formula(\"depVar ~ Var1 + Var2\")
How to reliably get dependent variable name from formu
Try using all.vars
:
all.vars(myformula)[1]
I know this question is quite old, but I thought I'd add a base R answer which doesn't require indexing, doesn't depend on the order of the variables listed in a call to all.vars
, and which gives the response variables as separate elements when there is more than one:
myformula <- formula("depVar1 + depVar2 ~ Var1 + Var2")
all_vars <- all.vars(myformula)
response <- all_vars[!(all_vars %in% labels(terms(myformula)))]
> response
[1] "depVar1" "depVar2"
Using all.vars
is very tricky as it won't detect the response from a one-sided formula. For example
all.vars(~x+1)
[1] "x"
that is wrong.
Here is the most reliable way of getting the response:
getResponseFromFormula = function(formula) {
if (attr(terms(as.formula(formula)) , which = 'response'))
all.vars(formula)[1]
else
NULL
}
getResponseFromFormula(~x+1)
NULL
getResponseFromFormula(y~x+1)
[1] "y"
Note that you can replace all.vars(formula)[1]
in the function with formula[2]
if the formula contains more than one variable for the response.
Based on your edit to get the actual response, not just its name, we can use the nonstandard evaluation idiom employed by lm()
and most other modelling functions with a formula interface in base R
form <- formula("depVar ~ Var1 + Var2")
dat <- data.frame(depVar = rnorm(10), Var1 = rnorm(10), Var2 = rnorm(10))
getResponse <- function(form, data) {
mf <- match.call(expand.dots = FALSE)
m <- match(c("formula", "data"), names(mf), 0L)
mf <- mf[c(1L, m)]
mf$drop.unused.levels <- TRUE
mf[[1L]] <- as.name("model.frame")
mf <- eval(mf, parent.frame())
y <- model.response(mf, "numeric")
y
}
> getResponse(form, dat)
1 2 3 4 5
-0.02828573 -0.41157817 2.45489291 1.39035938 -0.31267835
6 7 8 9 10
-0.39945771 -0.09141438 0.81826105 0.37448482 -0.55732976
As you see, this gets the actual response variable data from the supplied data frame.
How this works is that the function first captures the function call without expanding the ...
argument as that contains things not needed for the evaluation of the data for the formula.
Next, the "formula"
and "data"
arguments are matched with the call. The line mf[c(1L, m)]
selects the function name from the call (1L
) and the locations of the two matched arguments. The drop.unused.levels
argument of model.frame()
is set to TRUE
in the next line, and then the call is updated to switch the function name in the call from lm
to model.frame
. All the above code does is takes the call to lm()
and processes that call into a call to the model.frame()
function.
This modified call is then evaluated in the parent environment of the function - which in this case is the global environment.
The last line uses the model.response()
extractor function to take the response variable from the model frame.
I suppose you could also cook your own function to work with terms()
:
getResponse <- function(formula) {
tt <- terms(formula)
vars <- as.character(attr(tt, "variables"))[-1] ## [1] is the list call
response <- attr(tt, "response") # index of response var
vars[response]
}
R> myformula <- formula("depVar ~ Var1 + Var2")
R> getResponse(myformula)
[1] "depVar"
It is just as hacky as as.character(myformyula)[[2]]
but you have the assurance that you get the correct variable as the ordering of the call parse tree isn't going to change any time soon.
This isn't so good with multiple dependent variables:
R> myformula <- formula("depVar1 + depVar2 ~ Var1 + Var2")
R> getResponse(myformula)
[1] "depVar1 + depVar2"
as they'll need further processing.
I found an useful package 'formula.tools' which is suitable for your task.
code Example:
f <- as.formula(a1 + a2~a3 + a4)
lhs.vars(f) #get dependent variables
[1] "a1" "a2"
rhs.vars(f) #get independent variables
[1] "a3" "a4"