Selecting the statistically significant variables in an R glm model

前端 未结 4 1034
别跟我提以往
别跟我提以往 2020-12-23 22:52

I have an outcome variable, say Y and a list of 100 dimensions that could affect Y (say X1...X100).

After running my glm and viewing a summary of my mod

相关标签:
4条回答
  • 2020-12-23 23:37

    You can get access the pvalues of the glm result through the function "summary". The last column of the coefficients matrix is called "Pr(>|t|)" and holds the pvalues of the factors used in the model.

    Here's an example:

    #x is a 10 x 3 matrix
    x = matrix(rnorm(3*10), ncol=3)
    y = rnorm(10)
    res = glm(y~x)
    #ignore the intercept pval
    summary(res)$coeff[-1,4] < 0.05
    
    0 讨论(0)
  • 2020-12-23 23:42

    Although @kith paved the way, there is more that can be done. Actually, the whole process can be automated. First, let's create some data:

    x1 <- rnorm(10)
    x2 <- rnorm(10)
    x3 <- rnorm(10)
    y <- rnorm(10)
    x4 <- y + 5 # this will make a nice significant variable to test our code
    (mydata <- as.data.frame(cbind(x1,x2,x3,x4,y)))
    

    Our model is then:

    model <- glm(formula=y~x1+x2+x3+x4,data=mydata)
    

    And the Boolean vector of the coefficients can indeed be extracted by:

    toselect.x <- summary(model)$coeff[-1,4] < 0.05 # credit to kith
    

    But this is not all! In addition, we can do this:

    # select sig. variables
    relevant.x <- names(toselect.x)[toselect.x == TRUE] 
    # formula with only sig variables
    sig.formula <- as.formula(paste("y ~",relevant.x))  
    

    EDIT: as subsequent posters have pointed out, the latter line should be sig.formula <- as.formula(paste("y ~",paste(relevant.x, collapse= "+"))) to include all variables.

    And run the regression with only significant variables as OP originally wanted:

    sig.model <- glm(formula=sig.formula,data=mydata)
    

    In this case the estimate will be equal to 1 as we have defined x4 as y+5, implying the perfect relationship.

    0 讨论(0)
  • 2020-12-23 23:47

    in

    sig.formula <- as.formula(paste("y ~",relevant.x))

    you paste only the first variable of relevant.x the others are ignored (try for example to invert the condition to >0.5)

    0 讨论(0)
  • 2020-12-23 23:52

    For people having issue with Maxim.K command on

    sig.formula <- as.formula(paste("y ~",relevant.x))
    

    use this

    sig.formula <- as.formula(paste("y ~",paste(relevant.x, collapse= "+")))
    

    Final codes will be like

    toselect.x <- summary(glmText)$coeff[-1,4] < 0.05 # credit to kith
    # select sig. variables
    relevant.x <- names(toselect.x)[toselect.x == TRUE] 
    # formula with only sig variables
    sig.formula <- as.formula(paste("y ~",paste(relevant.x, collapse= "+")))  
    

    this fixes the bug you're facing with picking of the first variable alone.

    0 讨论(0)
提交回复
热议问题