问题
I was trying to figure out how weighting in lm
actually worked and I saw this 7,5 year old question which gives some insight in how weights work. The data from this question is partly copied and expanded on below.
I posted this related question, on Cross Validated.
library(plyr)
set.seed(100)
df <- data.frame(uid=1:200,
bp=sample(x=c(100:200),size=200,replace=TRUE),
age=sample(x=c(30:65),size=200,replace=TRUE),
weight=sample(c(1:10),size=200,replace=TRUE),
stringsAsFactors=FALSE)
set.seed(100)
df.double_weights <- data.frame(uid=1:200,
bp=sample(x=c(100:200),size=200,replace=TRUE),
age=sample(x=c(30:65),size=200,replace=TRUE),
weight=2*df$weight,
stringsAsFactors=FALSE)
df.expand <- ddply(df,
c("uid"),
function(df) {
data.frame(bp=rep(df[,"bp"],df[,"weight"]),
age=rep(df[,"age"],df[,"weight"]),
stringsAsFactors=FALSE)})
df.lm <- lm(bp~age,data=df,weights=weight)
df.double_weights.lm <- lm(bp~age,data=df.double_weights,weights=weight)
df.expand.lm <- lm(bp~age,data=df.expand)
summary(df.lm)
summary(df.double_weights.lm)
summary(df.expand.lm)
These three data.frames consist of exactly the same data. However;
In df
there are 200 observations which are weighted to add up to 1178, sum(df.$weight) == 1178
.
In df.double_weights
, the weights are simply doubled sum(df.double_weights$weight) == 2356
.
In df.expand
, there are instead of 200, weighted observations, 1178 unweighted observations.
The coefficients for both summary(df.lm)
and summary(df.double_weights.lm)
are the same, and so is the significance, (which means that, IF THE WEIGHTING WORKS PROPERLY, the absolute size of the weights is irrelevant). EDIT: It seems however that the absolute size does matter, see bottom result.
However, for summary(df.lm)
and summary(df.expand.lm)
, the coefficients are the same, but the significance differs.
summary(df.lm)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 165.6545 10.3850 15.951 <2e-16 ***
age -0.2852 0.2132 -1.338 0.183
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 98.84 on 198 degrees of freedom
Multiple R-squared: 0.008956, Adjusted R-squared: 0.003951
F-statistic: 1.789 on 1 and 198 DF, p-value: 0.1825
summary(df.expand.lm)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 165.65446 4.26123 38.88 < 2e-16 ***
age -0.28524 0.08749 -3.26 0.00115 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 28.68 on 1176 degrees of freedom
Multiple R-squared: 0.008956, Adjusted R-squared: 0.008114
F-statistic: 10.63 on 1 and 1176 DF, p-value: 0.001146
According to @IRTFM, the degrees of freedom are not being properly added up, providing this code to fix it:
df.lm.aov <- anova(df.lm)
df.lm.aov$Df[length(df.lm.aov$Df)] <-
sum(df.lm$weights)-
sum(df.lm.aov$Df[-length(df.lm.aov$Df)] ) -1
df.lm.aov$`Mean Sq` <- df.lm.aov$`Sum Sq`/df.lm.aov$Df
df.lm.aov$`F value`[1] <- df.lm.aov$`Mean Sq`[1]/
df.lm.aov$`Mean Sq`[2]
df.lm.aov$`Pr(>F)`[1] <- pf(df.lm.aov$`F value`[1], 1,
df.lm.aov$Df, lower.tail=FALSE)[2]
df.lm.aov
Analysis of Variance Table
Response: bp
Df Sum Sq Mean Sq F value Pr(>F)
age 1 8741 8740.5 10.628 0.001146 **
Residuals 1176 967146 822.4
Now, almost 8 years later, apparently this problem still persists (Does this not mean that almost all research that used weighted variables in combination with lm
from R
has too low significance values?) More practically, the problem I have is that I hardly understand what IRTFM is doing, or how it relates to multiple regression analysis (or even other functions that use lm
under the hood?).
QUESTION: Is there a more general way to solve this issue, that can be applied to multiple regression?
EDIT:
If we run IRTFM's solution on df.double_weights.lm
, we get a different result, so apparently the absolute size of the weights DOES matter.
Analysis of Variance Table
Response: bp
Df Sum Sq Mean Sq F value Pr(>F)
age 1 17481 17481.0 21.274 4.194e-06 ***
Residuals 2354 1934293 821.7
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
回答1:
If I understand your question correctly, what you have in your weights column is often called "frequency weights". They are used to save space in your dataset by indicating how many observations you have for each combination of covariates.
To estimate a model with an "aggregated" dataset and obtain correct standard errors, all you need to do is correct the number of degrees of freedom in your lm
model.
The correct number of degrees of freedom is the total number of observations, minus the number of parameters in your model. This can be calculated by taking the sum of your weights
variable or by looking at the total number of observations in the "full" data, and subtracting the number of parameters estimated (i.e., coefficients).
Here's a simpler example, which I think makes the point clearer:
library(dplyr)
library(modelsummary)
set.seed(1024)
# individual (true) dataset
x <- round(rnorm(1e5))
y <- round(x + x^2 + rnorm(1e5))
ind <- data.frame(x, y)
# aggregated dataset
agg <- ind %>%
group_by(x, y) %>%
summarize(freq = n())
models <- list(
"True" = lm(y ~ x, data = ind),
"Aggregated" = lm(y ~ x, data = agg),
"Aggregated & W" = lm(y ~ x, data = agg, weights=freq),
"Aggregated & W & DF" = lm(y ~ x, data = agg, weights=freq)
)
Now we want to correct the number of degrees of freedom of the last model in our list. We do this by taking the sum of our freq
column. We could also use nrow(ind)
, since those are identical:
# correct degrees of freedom
models[[4]]$df.residual <- sum(agg$freq) - length(coef(models[[4]]))
Finally, we summarize all 5 models using the modelsummary
package. Notice that the first and last models are exactly the same, even if the first was estimated using the full individual dataset, and the last was estimated using the aggregated data:
modelsummary(models, fmt=5)
True | Aggregated | Aggregated & W | Aggregated & W & DF | |
---|---|---|---|---|
(Intercept) | 1.08446 | 5.51391 | 1.08446 | 1.08446 |
(0.00580) | (0.71710) | (0.22402) | (0.00580) | |
x | 1.00898 | 0.91001 | 1.00898 | 1.00898 |
(0.00558) | (0.30240) | (0.21564) | (0.00558) | |
Num.Obs. | 1e+05 | 69 | 69 | 69 |
R2 | 0.246 | 0.119 | 0.246 | 0.246 |
R2 Adj. | 0.246 | 0.106 | 0.235 | 0.999 |
AIC | 405058.1 | 446.0 | 474.1 | 474.1 |
BIC | 405086.7 | 452.7 | 480.8 | 480.8 |
Log.Lik. | -202526.074 | -219.977 | -234.046 | -234.046 |
F | 32676.664 | 9.056 | 21.894 | 32676.664 |
来源:https://stackoverflow.com/questions/65458098/correcting-dfs-when-using-sample-weights-with-lm