问题
I'm reposting the question asked here hoping maybe to get a little more visibility.
This is a question concerning Lumley's survey
package for R. Specifically, its handling of interpolation for median estimation, after several hours of looking into the matter.
I'm using a svyrep
design which has the following form:
design <- svydesign(id = ~id_directorio, strata = ~estrato, weights = ~f_pers, check.strata = TRUE, data = datos)
options(survey.lonely.psu="remove")
set.seed(234262762)
SB2K_2 = as.svrepdesign(design, type = "subbootstrap", replicates=2000)
When trying to get the median through a svyquantile
in a svyby
function, I get wrong median estimates when the sample size is small for some group:
svyby(~ing_t_p, by = ~CL_REGION + ~CL_GRUPO_OCU_08, subset(SB2K_2, ocup_ref==1 & CL_REGION == "CHL02" & sexo == 2),
svyquantile, quantiles=c(0.5), method = "constant")
CL_REGION CL_GRUPO_OCU_08 V1 se
CHL02.ISCO08_1 CHL02 ISCO08_1 1005886.00 409590.92
CHL02.ISCO08_2 CHL02 ISCO08_2 749355.06 44882.23
CHL02.ISCO08_3 CHL02 ISCO08_3 490000.00 14406.91
CHL02.ISCO08_4 CHL02 ISCO08_4 450000.00 92620.61
CHL02.ISCO08_5 CHL02 ISCO08_5 289750.62 16685.00
CHL02.ISCO08_6 CHL02 ISCO08_6 449613.04 NaN #This is the row with a "wrong" median (V1)
CHL02.ISCO08_7 CHL02 ISCO08_7 95535.84 123539.27
CHL02.ISCO08_8 CHL02 ISCO08_8 599484.05 356666.34
CHL02.ISCO08_9 CHL02 ISCO08_9 299742.02 17933.51
The row where the median is 449613 has only two observations, but instead of showing the middle point between the two, it shows the smaller number (note that the two of them share the same weight, so the correct median value would be 500569):
datos %>% filter(CL_REGION == "CHL02" & sexo == 2 & CL_GRUPO_OCU_08 == "ISCO08_6") %>% select(ing_t_p, f_pers)
# A tibble: 2 x 2
ing_t_p f_pers
<dbl> <dbl>
1 449613. 98.7
2 551525. 98.7
After asking professor Lumley himself, he kindly pointed me to use the f
argument on svyquantile
, which deals with interpolation between data points. In this case, an f = 0.5
would get me the point in the middle, but it is not working and gives me an error message:
svyby(~ing_t_p, by = ~CL_REGION + ~CL_GRUPO_OCU_08, subset(SB2K_2, ocup_ref==1 & CL_REGION == "CHL02" & sexo == 2),
svyquantile, quantiles=c(0.5), method = "constant", f = 0.5)
Error in eval(predvars, data, env) : object 'ing_t_p' not found
Why do I get this error?
How can I get the correct median estimates with the survey
package when the groups are small?
EDIT:
Trying to boil down the problem, this arises with the svydesign
too (not using the svyrep.design
)
svyby(~ing_t_p, ~CL_REGION + ~CL_GRUPO_OCU_08, subset(design, ocup_ref==1 & CL_REGION == "CHL02" & sexo == 2),
+ svyquantile, quantiles=c(0.5), ci = TRUE)
CL_REGION CL_GRUPO_OCU_08 ing_t_p se
CHL02.ISCO08_1 CHL02 ISCO08_1 1005262.68 248216.08
CHL02.ISCO08_2 CHL02 ISCO08_2 749355.06 62219.18
CHL02.ISCO08_3 CHL02 ISCO08_3 489643.22 33507.74
CHL02.ISCO08_4 CHL02 ISCO08_4 449997.64 74549.55
CHL02.ISCO08_5 CHL02 ISCO08_5 284307.34 15408.06
CHL02.ISCO08_6 CHL02 ISCO08_6 449613.04 NaN
CHL02.ISCO08_7 CHL02 ISCO08_7 93033.74 109500.28
CHL02.ISCO08_8 CHL02 ISCO08_8 547251.67 429428.77
CHL02.ISCO08_9 CHL02 ISCO08_9 296445.55 18053.37
来源:https://stackoverflow.com/questions/62306784/rs-survey-package-interpolation-handling-for-median-estimates