Get number of data in each factor level (as well as interaction) from a fitted lm or glm [R]

问题

I have a logistic regression model in R, where all of the predictor variables are categorical rather than continuous (in addition to the response variable, which is also obviously categorical/binary).

When calling summary(model_name), is there a way to include a column representing the number of observations within each factor level?

回答1:

I have a logistic regression model in R, where all of the predictor variables are categorical rather than continuous.

If all your covariates are factors (not including the intercept), this is fairly easy as the model matrix only contains 0 and 1 and the number of 1 indicates the occurrence of that factor level (or interaction level) in your data. So just do colSums(model.matrix(your_glm_model_object)).

Since a model matrix has column names, colSums will give you a vector with "names" attribute, that is consistent with the "names" field of coef(your_glm_model_object).

The same solution applies to a linear model (by lm) and a generalized linear model (by glm) for any distribution family.

Here is a quick example:

set.seed(0)
f1 <- sample(gl(2, 50))  ## a factor with 2 levels, each with 50 observations
f2 <- sample(gl(4, 25))  ## a factor with 4 levels, each with 25 observations
y <- rnorm(100)
fit <- glm(y ~ f1 * f2)  ## or use `lm` as we use `guassian()` family object here
colSums(model.matrix(fit))
#(Intercept)         f12         f22         f23         f24     f12:f22 
#        100          50          25          25          25          12 
#    f12:f23     f12:f24 
#         12          14

Here, we have 100 observations / complete-cases (indicated under (Intercept)).

Is there a way to display the count for the baseline level of each factor?

Baseline levels are contrasted, so they don't appear in the the model matrix used for fitting. However, we can generate the full model matrix (without contrasts) from your formula not your fitted model (this also offers you a way to drop numeric variables if you have them in your model):

SET_CONTRAST <- list(f1 = contr.treatment(nlevels(f1), contrast = FALSE),
                     f2 = contr.treatment(nlevels(f2), contrast = FALSE))
X <- model.matrix(~ f1 * f2, contrasts.arg = SET_CONTRAST)
colSums(X)
#(Intercept)         f11         f12         f21         f22         f23 
#        100          50          50          25          25          25 
#        f24     f11:f21     f12:f21     f11:f22     f12:f22     f11:f23 
#         25          13          12          13          12          13 
#    f12:f23     f11:f24     f12:f24 
#         12          11          14

Note that it can quickly become tedious in setting contrasts when you have many factor variables.

model.matrix is definitely not the only approach for this. The conventional way may be

table(f1)
table(f2)
table(f1, f2)

but could get tedious too when your model become complicated.

来源：https://stackoverflow.com/questions/51408919/get-number-of-data-in-each-factor-level-as-well-as-interaction-from-a-fitted-l

标签

regression

linear-regression

glm