Why do attempts to filter/subset a raked survey design object fail?

故事扮演 提交于 2020-01-03 02:47:05

问题


I'm trying to filter rows in a survey design object to exclude a particular subset of data. In the example below, which consists of survey data from several schools, I'm trying to exclude data from schools in Alameda County, California.

Surprisingly, when the survey design object includes weights created by raking, attempts to filter or subset the data fail. I think this is a bug, but I'm not sure. Why does the presence of raked weights alter the result of attempting to filter or subset the data?

library(survey)

data(api)

# Declare basic clustered design ----
cluster_design <- svydesign(data = apiclus1,
                            id = ~dnum,
                            weights = ~pw,
                            fpc = ~fpc)

# Add raking weights for school type ----
pop.types <- data.frame(stype=c("E","H","M"), Freq=c(4421,755,1018))
pop.schwide <- data.frame(sch.wide=c("No","Yes"), Freq=c(1072,5122))

raked_design <- rake(cluster_design,
                     sample.margins = list(~stype,~sch.wide),
                     population.margins = list(pop.types, pop.schwide))

# Filter the two different design objects ----
subset_from_raked_design <- subset(raked_design, cname != "Alameda")

subset_from_cluster_design <- subset(cluster_design, cname != "Alameda")

# Count number of rows in the subsets
# Note that they surprisingly differ
  nrow(subset_from_raked_design)
#> [1] 183
  nrow(subset_from_cluster_design)
#> [1] 172

This issue occurs no matter how you attempt to subset the data. For example, here's what happens when you try to use row-indexing to subset only the first 10 rows:

nrow(cluster_design[1:10,])
#> 10
nrow(raked_design[1:10,])
#> 183

回答1:


This behavior is a result of the fact that the survey package is trying to help you avoid making a statistical mistake.

For especially complex designs involving calibration/post-stratification/raking, estimates for sub-populations can't simply be computed by filtering away data from outside of the sub-population of interest; that approach produces misleading standard errors and confidence intervals.

So to keep you from running into this statistical issue, the survey package doesn't let you completely remove records outside of your subset of interest. Instead, it essentially takes note of which rows you want to ignore and then adjusts the probability weights to be effectively zero.

In the example from this question, you can see that in the rows that were meant to be filtered away, their value in the subset_from_raked_design$prob object equals Inf (which effectively means the corresponding rows in the data are assigned a weight of zero.)

subset_from_raked_design$prob[1:12]
#> Inf Inf Inf Inf Inf Inf
#> Inf Inf Inf Inf Inf 
#> 0.01986881 ....

raked_design$prob[1:12]
#> 0.01986881 0.03347789 0.03347789 0.03347789 0.03347789 0.03347789
#> 0.03347789 0.03347789 0.03347789 0.02717969 0.02717969
#> 0.01986881 ....


来源:https://stackoverflow.com/questions/55384157/why-do-attempts-to-filter-subset-a-raked-survey-design-object-fail

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!