Any pitfalls to using programmatically constructed formulas?

拜拜、爱过 提交于 2019-12-04 03:13:32

问题


I'm wanting to run through a long vector of potential explanatory variables, regressing a response variable on each in turn. Rather than paste together the model formula, I'm thinking of using reformulate(), as demonstrated here.

The function fun() below seems to do the job, fitting the desired model. Notice, though, that it records in its call element the name of the constructed formula object rather than its value.

## (1) Function using programmatically constructed formula
fun <- function(XX) {
    ff <- reformulate(response="mpg", termlabels=XX)
    lm(ff, data=mtcars)
}
fun(XX=c("cyl", "disp"))
# 
# Call:
# lm(formula = ff, data = mtcars)                 <<<--- Note recorded call
# 
# Coefficients:
# (Intercept)          cyl         disp  
#    34.66099     -1.58728     -0.02058  

## (2) Result of directly specified formula (just for purposes of comparison)
lm(mpg ~ cyl + disp, data=mtcars)
# 
# Call:
# lm(formula = mpg ~ cyl + disp, data = mtcars)   <<<--- Note recorded call
# 
# Coefficients:
# (Intercept)          cyl         disp  
#    34.66099     -1.58728     -0.02058  

My question: Is there any danger in this? Can this become a problem if, for instance, I want to later apply update, or predict or some other function to the model fit object, (possibly from some other environment)?

A slightly more awkward alternative that does, nevertheless, get the recorded call right is to use eval(substitute()). Is this in any way a generally safer construct?

fun2 <- function(XX) {
    ff <- reformulate(response="mpg", termlabels=XX)
    eval(substitute(lm(FF, data=mtcars), list(FF=ff)))
}
fun2(XX=c("cyl", "disp"))$call
## lm(formula = mpg ~ cyl + disp, data = mtcars)

回答1:


I'm always hesitant to claim there are no situations in which something involving R environments and scoping might bite, but ... after some more exploration, my first usage above does look safe.

It turns out that the printed call is a bit of red herring.

The formula that actually gets used by other functions (and the one extracted by formula() and as.formula()) is the one stored in the terms element of the fit object, and it gets the actual formula right. (The terms element contains an object of class "terms", which is just a "formula" with a bunch of attached attributes.)

To see that all of the proposals in my question and the associated comments store the same "formula" object (up to the associated environment), run the following.

## First the three approaches in my post
formula(fun(XX=c("cyl", "disp")))
# mpg ~ cyl + disp
# <environment: 0x026d2b7c>

formula(lm(mpg ~ cyl + disp, data=mtcars))
# mpg ~ cyl + disp

formula(fun2(XX=c("cyl", "disp"))$call)
# mpg ~ cyl + disp
# <environment: 0x02c4ce2c>

## Then Gabor Grothendieck's idea
XX = c("cyl", "disp")
ff <- reformulate(response="mpg", termlabels=XX)
formula(do.call("lm", list(ff, quote(mtcars))))  
## mpg ~ cyl + disp

To confirm that formula() really is deriving its output from the terms element of the fit object, have a look at stats:::formula.lm and stats:::formula.terms.



来源:https://stackoverflow.com/questions/17395724/any-pitfalls-to-using-programmatically-constructed-formulas

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!