How do estimation commands find variable names in formulas in R?

问题

I'd like to estimate a large number of models using R's nls() function on a user-defined function. Since many variables are fixed across my specifications, I'd like a way of pre-setting them in my function, but I don't properly understand how R looks for variables in functions contained in a formula.

I've had a look at the section on metaprogramming in Hadley Wickham's advanced R book, but it hasn't enlightened me. Here is a simplified example of what I'm trying to acheive, using the mtcars dataset:

I have tried setting a default value for variables that are fixed across specificaitons:

expo <- function(x, theta, weight = wt) {
  x*weight^theta
}

I have also tried just using the column name of the fixed variable as a variable inside the function

expo <- function(x, theta) {
  x*wt^theta
}

Both these approaches work if I just want to calculate the function, say with

attach(mtcars)
expo(qsec, 1)
detach()

But if I try using my expo() function in an estimation routine, for example

nls(mpg ~ phi + expo(qsec, theta),
    data = mtcars,
    start = c('phi' = -2, 'theta' = 1))

It fails with the message Error in expo(qsec, theta) : object 'wt' not found. One possibility, brought up in the comments, is to simply pass the dataset, mtcars in this case, to expo() as an argument. But since I will only call expo() inside of a call to nls() where the dataset is already an argument, I would be happy if I could find a way to avoid this repetition.

My ultimate goal after defining or calling expo() appropriately is to be able to do something like this:

depvars <- c('qsec', 'drat', 'dist')
lapply <- (depvars, function(x) {
    formula <- as.formula(paste0('mpg ~ phi + expo(', x, ', theta)'))
    nls(formula,
        data = mtcars,
        start = c('phi' = -2, 'theta' = 1))
}

回答1:

The tricky thing is that R's lexical scoping searches in enclosing environments, which can be confusing during calls because the caller environments can each have enclosing environments and things get confusing pretty quickly.

I'll be using the rlang package to debug this scenario.

First, if you defined expo in the global environment, then that will be its enclosing environment:

expo <- function(x, theta) {
  x*wt^theta
}

rlang::get_env(expo)
# <environment: R_GlobalEnv>

So when you call it, R will first search for variables in the function's call (not caller!) environment, and then in the enclosing environment (global environment here).

I don't know what nls does exactly, but I would have assumed that it creates an environment from the data you provide and evaluates the formula there. However, it seems the environment it creates only contains the variables it can explicitly see in the formula, something I found with:

expo <- function(x, theta) {
  cat("caller: ")
  print(ls(rlang::caller_env()))
  cat("enclosing: ")
  print(ls(rlang::env_parent(rlang::current_env())))
}

nls(mpg ~ phi + expo(qsec, theta),
    data = mtcars,
    start = c('phi' = -2, 'theta' = 1))
# caller: [1] "mpg"   "phi"   "qsec"  "theta"
# enclosing: [1] "expo"    
# Error ...

As we can see, the caller environment of expo contains the variables we can identify in the formula, and its enclosing environment only contains the definition of expo (the global environment). This unfortunately means that you can't even use something like eval.parent inside expo, because that environment won't have all variables from data.

If you still want to work around it, you could modify expo's enclosing environment with your data before calling nls, something like:

expo <- function(x, theta) {
  x*wt^theta
}

environment(expo) <- list2env(as.list(mtcars))

nls(mpg ~ phi + expo(qsec, theta),
    data = mtcars,
    start = c('phi' = -2, 'theta' = 1))
# Error ... number of iterations exceeded maximum of 50

回答2:

I've accepted Alexis's answer, since it addresses my original question. I nevertheless thought I would share the solution I adopted, in case anyone finds it useful.

As Alexis says, the solution needs to involve modifying the enclosing environment of expo(). Rather than manually doing this each time (and perhaps changing it back to the original environment after each call to expo()), my approach combines the requirement that expo()'s environment contain the right variables with NelsonGon's suggestion that I feed the dataset as an argument at some point. I do this by creating a function factory, make_expo(), that sets the required variables and returns expo(), so that the variables are automatically in expo()'s enclosing environment:

make_expo <- function(df, vars = c('wt')) {
  wt <- df[[vars[1]]]
  function(x, theta) {
    x + wt^theta
  }
}

expo <- make_expo(mtcars)

nls(mpg ~ phi + expo(qsec, theta),
    data = mtcars,
    start = c('phi' = 1, theta = 1))
# Error ... number of iterations exceeded maximum of 50

I think this has two advantages. First, it's more robust, since you don't need to remember to set the environment of expo(), it's automatically set when expo() is defined. Nevertheless make_expo() is flexible - I can set defaults, or feed in different datasets. Second, it keeps the arguments expo() requires down to those that I actually expect to vary in different calls to expo(), improving comprehensibility

I was surprised to learn that formulas create an environment in which to look up names that only contains the variables explicitly named in the formula, and not also other variables in the dataset passed to nls(), but there you go.

来源：https://stackoverflow.com/questions/56704424/how-do-estimation-commands-find-variable-names-in-formulas-in-r

标签

nls