问题
Follow up on R's survey package interpolation handling for median estimates, which has not attracted many feedback. I have managed to boil down the issue to the following:
I'm using R's survey
package to get the median estimation for a set of data. The data to replicate this issue is available as a dput
text here.
The design I'm using is a class svyrep.design
defined as the following:
design <- svydesign(id = ~id_directorio, strata = ~estrato, weights = ~f_pers, check.strata = TRUE, data = datos)
set.seed(234262762)
repdesign <- as.svrepdesign(design, type = "subbootstrap", replicates=20)
options(survey.lonely.psu="adjust")
A svyquantile
inside a svyby
does the job as expected:
svyby(formula = ~ing_t_p, by = ~CL_GRUPO_OCU_08, repdesign, svyquantile, quantiles=c(0.5), method="constant",
f = 0.5, ties = "rounded", vartype=c("ci", "se"), ci=TRUE, na.rm=FALSE)
CL_GRUPO_OCU_08 V1 se cv cv%
ISCO08_1 ISCO08_1 1002513.04 269630.31 0.26895442 26.895442
ISCO08_2 ISCO08_2 744505.53 68827.09 0.09244672 9.244672
ISCO08_3 ISCO08_3 489789.32 42839.16 0.08746447 8.746447
ISCO08_4 ISCO08_4 449806.52 69526.34 0.15456944 15.456944
ISCO08_5 ISCO08_5 286705.37 13392.01 0.04671002 4.671002
ISCO08_6 ISCO08_6 449613.04 NaN NaN NaN
ISCO08_7 ISCO08_7 93032.83 109534.62 1.17737600 117.737600
ISCO08_8 ISCO08_8 564514.15 437752.31 0.77544967 77.544967
ISCO08_9 ISCO08_9 293712.84 24497.97 0.08340790 8.340790
However, see the estimation for category ISCO08_6
. Its not giving the expected median result. Instead, is showing the smallest number of the two:
datos %>% filter(CL_GRUPO_OCU_08 == "ISCO08_6")
# A tibble: 2 x 5
id_directorio estrato f_pers ing_t_p CL_GRUPO_OCU_08
<dbl> <dbl> <dbl> <dbl> <chr>
1 24568 2021 98.7 449613. ISCO08_6
2 24568 2021 98.7 551525. ISCO08_6
The f
argument should deal with this (it manages data interpolation); and indeed it does for all the other cases, but it does not have an effect on the ISCO08_6
row. I have found that this issue affects estimations where there are only 2 or 4 data points.
So how do I get the median result using this method when the number of datapoints are small?
回答1:
Ok, it looks as though you need to ask for a quantile very slightly larger than 0.5 to get what you want. I will look into whether this is a bug or whether it was necessary to get agreement with some other system like SUDAAN. I will either fix or document this for the next version (or perhaps add yet another option). Quantiles are the worst.
Here are examples just using svyquantile()
> svyquantile(~ing_t_p, quantile=0.5000001, design=dd, f=0.5, ies="rounded", method="constant")
0.5
ing_t_p 500569.2
> svyquantile(~ing_t_p, quantile=0.5000001, design=dd, f=0, ties="rounded", method="constant")
0.5
ing_t_p 449613
> svyquantile(~ing_t_p, quantile=0.5000001, design=dd, f=1, ties="rounded", method="constant")
0.5
ing_t_p 551525.3
And here using svyby()
. Note that you have to use formula=
in the first argument, otherwise the f=0.5
argument is interpreted by R as formula=0.5
> svyby(formula=~ing_t_p, by = ~CL_GRUPO_OCU_08, design, svyquantile, quantiles=c(0.5000001),f=0.5, method="constant", vartype=c("ci", "se"), ci=TRUE, na.rm.all=FALSE)
CL_GRUPO_OCU_08 ing_t_p se ci_l ci_u
ISCO08_1 ISCO08_1 1002513.04 254418.31 550769.11 1629454.6
ISCO08_2 ISCO08_2 749355.06 62294.16 649720.53 899613.0
ISCO08_3 ISCO08_3 489789.32 32140.54 409819.42 538808.8
ISCO08_4 ISCO08_4 449806.52 74549.55 349699.00 650000.0
ISCO08_5 ISCO08_5 286705.37 15349.64 240706.43 301766.1
ISCO08_6 ISCO08_6 500569.18 NaN NaN NaN
ISCO08_7 ISCO08_7 93032.83 108653.60 55000.00 503500.0
ISCO08_8 ISCO08_8 564514.15 429428.77 80470.95 2061000.0
ISCO08_9 ISCO08_9 293712.84 18830.76 245000.00 320539.5
There were 12 warnings (use warnings() to see them)
来源:https://stackoverflow.com/questions/62452042/f-argument-of-survey-package-does-not-give-expected-output