问题

I was trying to figure out how weighting in lm actually worked and I saw this 7,5 year old question which gives some insight in how weights work. The data from this question is partly copied and expanded on below.

I posted this related question, on Cross Validated.

library(plyr)
set.seed(100)
df <- data.frame(uid=1:200,
                      bp=sample(x=c(100:200),size=200,replace=TRUE),
                      age=sample(x=c(30:65),size=200,replace=TRUE),
                      weight=sample(c(1:10),size=200,replace=TRUE),
                      stringsAsFactors=FALSE)

set.seed(100)
df.double_weights <- data.frame(uid=1:200,
                      bp=sample(x=c(100:200),size=200,replace=TRUE),
                      age=sample(x=c(30:65),size=200,replace=TRUE),
                      weight=2*df$weight,
                      stringsAsFactors=FALSE)

df.expand <- ddply(df,
                        c("uid"),
                        function(df) {
                          data.frame(bp=rep(df[,"bp"],df[,"weight"]),
                                     age=rep(df[,"age"],df[,"weight"]),
                                     stringsAsFactors=FALSE)})

df.lm <- lm(bp~age,data=df,weights=weight)
df.double_weights.lm <- lm(bp~age,data=df.double_weights,weights=weight)
df.expand.lm <- lm(bp~age,data=df.expand)

summary(df.lm)
summary(df.double_weights.lm)
summary(df.expand.lm)

These three data.frames consist of exactly the same data. However;

In df there are 200 observations which are weighted to add up to 1178, sum(df.$weight) == 1178.

In df.double_weights, the weights are simply doubled sum(df.double_weights$weight) == 2356.

In df.expand, there are instead of 200, weighted observations, 1178 unweighted observations.

The coefficients for both summary(df.lm) and summary(df.double_weights.lm) are the same, and so is the significance, (which means that, IF THE WEIGHTING WORKS PROPERLY, the absolute size of the weights is irrelevant). EDIT: It seems however that the absolute size does matter, see bottom result.

However, for summary(df.lm) and summary(df.expand.lm), the coefficients are the same, but the significance differs.

summary(df.lm)

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 165.6545    10.3850  15.951   <2e-16 ***
age          -0.2852     0.2132  -1.338    0.183    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 98.84 on 198 degrees of freedom
Multiple R-squared:  0.008956,  Adjusted R-squared:  0.003951 
F-statistic: 1.789 on 1 and 198 DF,  p-value: 0.1825

summary(df.expand.lm)

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 165.65446    4.26123   38.88  < 2e-16 ***
age          -0.28524    0.08749   -3.26  0.00115 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 28.68 on 1176 degrees of freedom
Multiple R-squared:  0.008956,  Adjusted R-squared:  0.008114 
F-statistic: 10.63 on 1 and 1176 DF,  p-value: 0.001146

According to @IRTFM, the degrees of freedom are not being properly added up, providing this code to fix it:

df.lm.aov <- anova(df.lm)
df.lm.aov$Df[length(df.lm.aov$Df)] <- 
        sum(df.lm$weights)-   
        sum(df.lm.aov$Df[-length(df.lm.aov$Df)]  ) -1
df.lm.aov$`Mean Sq` <- df.lm.aov$`Sum Sq`/df.lm.aov$Df
df.lm.aov$`F value`[1] <- df.lm.aov$`Mean Sq`[1]/
                                        df.lm.aov$`Mean Sq`[2]
df.lm.aov$`Pr(>F)`[1] <- pf(df.lm.aov$`F value`[1], 1, 
                                      df.lm.aov$Df, lower.tail=FALSE)[2]
df.lm.aov

Analysis of Variance Table

Response: bp
            Df Sum Sq Mean Sq F value   Pr(>F)   
age          1   8741  8740.5  10.628 0.001146 **
Residuals 1176 967146   822.4

Now, almost 8 years later, apparently this problem still persists (Does this not mean that almost all research that used weighted variables in combination with lm from R has too low significance values?) More practically, the problem I have is that I hardly understand what IRTFM is doing, or how it relates to multiple regression analysis (or even other functions that use lm under the hood?).

QUESTION: Is there a more general way to solve this issue, that can be applied to multiple regression?

EDIT:

If we run IRTFM's solution on df.double_weights.lm, we get a different result, so apparently the absolute size of the weights DOES matter.

Analysis of Variance Table

Response: bp
            Df  Sum Sq Mean Sq F value    Pr(>F)    
age          1   17481 17481.0  21.274 4.194e-06 ***
Residuals 2354 1934293   821.7                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

回答1:

If I understand your question correctly, what you have in your weights column is often called "frequency weights". They are used to save space in your dataset by indicating how many observations you have for each combination of covariates.

To estimate a model with an "aggregated" dataset and obtain correct standard errors, all you need to do is correct the number of degrees of freedom in your lm model.

The correct number of degrees of freedom is the total number of observations, minus the number of parameters in your model. This can be calculated by taking the sum of your weights variable or by looking at the total number of observations in the "full" data, and subtracting the number of parameters estimated (i.e., coefficients).

Here's a simpler example, which I think makes the point clearer:

library(dplyr)
library(modelsummary)

set.seed(1024)

# individual (true) dataset
x <- round(rnorm(1e5))
y <- round(x + x^2 + rnorm(1e5))
ind <- data.frame(x, y)

# aggregated dataset
agg <- ind %>%
  group_by(x, y) %>%
  summarize(freq = n())

models <- list( 
  "True"                = lm(y ~ x, data = ind),
  "Aggregated"          = lm(y ~ x, data = agg),
  "Aggregated & W"      = lm(y ~ x, data = agg, weights=freq),
  "Aggregated & W & DF" = lm(y ~ x, data = agg, weights=freq)
)

Now we want to correct the number of degrees of freedom of the last model in our list. We do this by taking the sum of our freq column. We could also use nrow(ind), since those are identical:

# correct degrees of freedom
models[[4]]$df.residual <- sum(agg$freq) - length(coef(models[[4]]))

Finally, we summarize all 5 models using the modelsummary package. Notice that the first and last models are exactly the same, even if the first was estimated using the full individual dataset, and the last was estimated using the aggregated data:

modelsummary(models, fmt=5)

	True	Aggregated	Aggregated & W	Aggregated & W & DF
(Intercept)	1.08446	5.51391	1.08446	1.08446
	(0.00580)	(0.71710)	(0.22402)	(0.00580)
x	1.00898	0.91001	1.00898	1.00898
	(0.00558)	(0.30240)	(0.21564)	(0.00558)
Num.Obs.	1e+05	69	69	69
R2	0.246	0.119	0.246	0.246
R2 Adj.	0.246	0.106	0.235	0.999
AIC	405058.1	446.0	474.1	474.1
BIC	405086.7	452.7	480.8	480.8
Log.Lik.	-202526.074	-219.977	-234.046	-234.046
F	32676.664	9.056	21.894	32676.664

来源：https://stackoverflow.com/questions/65458098/correcting-dfs-when-using-sample-weights-with-lm

标签

weighted

Correcting dfs when using sample weights with lm

问题

QUESTION: Is there a more general way to solve this issue, that can be applied to multiple regression?

回答1: