Methods in R for large complex survey data sets?

后端 未结 2 1148
面向向阳花
面向向阳花 2021-02-09 05:44

I am not a survey methodologist or demographer, but am an avid fan of Thomas Lumley\'s R survey package. I\'ve been working with a relatively large complex survey data set, the

相关标签:
2条回答
  • 2021-02-09 06:10

    Been a while, but closing the loop on this. As Dr. Lumley refers to in the recent comment above, Charco Hui resurrected the experimental sqlsurvey package as "svydb", which I've found to be a great tool for working with very large survey data sets in R. See a related post here: How to get svydb R package for large survey data sets to return standard errors

    0 讨论(0)
  • 2021-02-09 06:12

    for huge data sets, linearized designs (svydesign) are much slower than replication designs (svrepdesign). review the weighting functions within survey::as.svrepdesign and use one of them to directly make a replication design. you cannot use linearization for this task. and you are likely better off not even using as.svrepdesign but instead using the functions within it.

    for one example using cluster=, strata=, and fpc= directly into a replicate-weighted design, see

    https://github.com/ajdamico/asdfree/blob/master/Censo%20Demografico/download%20and%20import.R#L405-L429

    note you can also view minute-by-minute speed tests (with timestamps for each event) here http://monetdb.cwi.nl/testweb/web/eanthony/

    also note that the replicates= argument is nearly 100% responsible for the speed that the design will run. so perhaps make two designs, one for coefficients (with just a couple of replicates) and another for SEs (with as many as you can tolerate). run your coefficients interactively and refine which numbers you need during the day, then leave the bigger processes that require SE calculations running overnight

    0 讨论(0)
提交回复
热议问题