问题
I'm trying to filter rows in a survey design object to exclude a particular subset of data. In the example below, which consists of survey data from several schools, I'm trying to exclude data from schools in Alameda County, California.
Surprisingly, when the survey design object includes weights created by raking, attempts to filter or subset the data fail. I think this is a bug, but I'm not sure. Why does the presence of raked weights alter the result of attempting to filter or subset the data?
library(survey)
data(api)
# Declare basic clustered design ----
cluster_design <- svydesign(data = apiclus1,
id = ~dnum,
weights = ~pw,
fpc = ~fpc)
# Add raking weights for school type ----
pop.types <- data.frame(stype=c("E","H","M"), Freq=c(4421,755,1018))
pop.schwide <- data.frame(sch.wide=c("No","Yes"), Freq=c(1072,5122))
raked_design <- rake(cluster_design,
sample.margins = list(~stype,~sch.wide),
population.margins = list(pop.types, pop.schwide))
# Filter the two different design objects ----
subset_from_raked_design <- subset(raked_design, cname != "Alameda")
subset_from_cluster_design <- subset(cluster_design, cname != "Alameda")
# Count number of rows in the subsets
# Note that they surprisingly differ
nrow(subset_from_raked_design)
#> [1] 183
nrow(subset_from_cluster_design)
#> [1] 172
This issue occurs no matter how you attempt to subset the data. For example, here's what happens when you try to use row-indexing to subset only the first 10 rows:
nrow(cluster_design[1:10,])
#> 10
nrow(raked_design[1:10,])
#> 183
回答1:
This behavior is a result of the fact that the survey
package is trying to help you avoid making a statistical mistake.
For especially complex designs involving calibration/post-stratification/raking, estimates for sub-populations can't simply be computed by filtering away data from outside of the sub-population of interest; that approach produces misleading standard errors and confidence intervals.
So to keep you from running into this statistical issue, the survey
package doesn't let you completely remove records outside of your subset of interest. Instead, it essentially takes note of which rows you want to ignore and then adjusts the probability weights to be effectively zero.
In the example from this question, you can see that in the rows that were meant to be filtered away, their value in the subset_from_raked_design$prob
object equals Inf
(which effectively means the corresponding rows in the data are assigned a weight of zero.)
subset_from_raked_design$prob[1:12]
#> Inf Inf Inf Inf Inf Inf
#> Inf Inf Inf Inf Inf
#> 0.01986881 ....
raked_design$prob[1:12]
#> 0.01986881 0.03347789 0.03347789 0.03347789 0.03347789 0.03347789
#> 0.03347789 0.03347789 0.03347789 0.02717969 0.02717969
#> 0.01986881 ....
来源:https://stackoverflow.com/questions/55384157/why-do-attempts-to-filter-subset-a-raked-survey-design-object-fail