How to succinctly write a formula with many variables from a data frame?

后端 未结 6 2091
一整个雨季
一整个雨季 2020-11-22 17:01

Suppose I have a response variable and a data containing three covariates (as a toy example):

y = c(1,4,6)
d = data.frame(x1 = c(4,-1,3), x2 = c(3,9,8), x3 =         


        
相关标签:
6条回答
  • 2020-11-22 17:36

    A slightly different approach is to create your formula from a string. In the formula help page you will find the following example :

    ## Create a formula for a model with a large number of variables:
    xnam <- paste("x", 1:25, sep="")
    fmla <- as.formula(paste("y ~ ", paste(xnam, collapse= "+")))
    

    Then if you look at the generated formula, you will get :

    R> fmla
    y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + 
        x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 + 
        x22 + x23 + x24 + x25
    
    0 讨论(0)
  • 2020-11-22 17:42

    Yes of course, just add the response y as first column in the dataframe and call lm() on it:

    d2<-data.frame(y,d)
    > d2
      y x1 x2 x3
    1 1  4  3  4
    2 4 -1  9 -4
    3 6  3  8 -2
    > lm(d2)
    
    Call:
    lm(formula = d2)
    
    Coefficients:
    (Intercept)           x1           x2           x3  
        -5.6316       0.7895       1.1579           NA  
    

    Also, my information about R points out that assignment with <- is recommended over =.

    0 讨论(0)
  • 2020-11-22 17:42

    An extension of juba's method is to use reformulate, a function which is explicitly designed for such a task.

    ## Create a formula for a model with a large number of variables:
    xnam <- paste("x", 1:25, sep="")
    
    reformulate(xnam, "y")
    y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + 
        x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 + 
        x22 + x23 + x24 + x25
    

    For the example in the OP, the easiest solution here would be

    # add y variable to data.frame d
    d <- cbind(y, d)
    reformulate(names(d)[-1], names(d[1]))
    y ~ x1 + x2 + x3
    

    or

    mod <- lm(reformulate(names(d)[-1], names(d[1])), data=d)
    

    Note that adding the dependent variable to the data.frame in d <- cbind(y, d) is preferred not only because it allows for the use of reformulate, but also because it allows for future use of the lm object in functions like predict.

    0 讨论(0)
  • 2020-11-22 17:46

    You can check the package leaps and in particular the function regsubsets() functions for model selection. As stated in the documentation:

    Model selection by exhaustive search, forward or backward stepwise, or sequential replacement

    0 讨论(0)
  • 2020-11-22 17:55

    I build this solution, reformulate does not take care if variable names have white spaces.

    add_backticks = function(x) {
        paste0("`", x, "`")
    }
    
    x_lm_formula = function(x) {
        paste(add_backticks(x), collapse = " + ")
    }
    
    build_lm_formula = function(x, y){
        if (length(y)>1){
            stop("y needs to be just one variable")
        }
        as.formula(        
            paste0("`",y,"`", " ~ ", x_lm_formula(x))
        )
    }
    
    # Example
    df <- data.frame(
        y = c(1,4,6), 
        x1 = c(4,-1,3), 
        x2 = c(3,9,8), 
        x3 = c(4,-4,-2)
        )
    
    # Model Specification
    columns = colnames(df)
    y_cols = columns[1]
    x_cols = columns[2:length(columns)]
    formula = build_lm_formula(x_cols, y_cols)
    formula
    # output
    # "`y` ~ `x1` + `x2` + `x3`"
    
    # Run Model
    lm(formula = formula, data = df)
    # output
    Call:
        lm(formula = formula, data = df)
    
    Coefficients:
        (Intercept)           x1           x2           x3  
            -5.6316       0.7895       1.1579           NA  
    

    ```

    0 讨论(0)
  • 2020-11-22 17:58

    There is a special identifier that one can use in a formula to mean all the variables, it is the . identifier.

    y <- c(1,4,6)
    d <- data.frame(y = y, x1 = c(4,-1,3), x2 = c(3,9,8), x3 = c(4,-4,-2))
    mod <- lm(y ~ ., data = d)
    

    You can also do things like this, to use all variables but one (in this case x3 is excluded):

    mod <- lm(y ~ . - x3, data = d)
    

    Technically, . means all variables not already mentioned in the formula. For example

    lm(y ~ x1 * x2 + ., data = d)
    

    where . would only reference x3 as x1 and x2 are already in the formula.

    0 讨论(0)
提交回复
热议问题