Methods in R for large complex survey data sets?

后端未结

关注

 2  1156

I am not a survey methodologist or demographer, but am an avid fan of Thomas Lumley\'s R survey package. I\'ve been working with a relatively large complex survey data set, the

相关标签:

2条回答

感动是毒

2021-02-09 06:10

Been a while, but closing the loop on this. As Dr. Lumley refers to in the recent comment above, Charco Hui resurrected the experimental sqlsurvey package as "svydb", which I've found to be a great tool for working with very large survey data sets in R. See a related post here: How to get svydb R package for large survey data sets to return standard errors

0 讨论(0)
发布评论:

提交评论
- 加载中...
时光说笑

2021-02-09 06:12

for huge data sets, linearized designs (svydesign) are much slower than replication designs (svrepdesign). review the weighting functions within survey::as.svrepdesign and use one of them to directly make a replication design. you cannot use linearization for this task. and you are likely better off not even using as.svrepdesign but instead using the functions within it.

for one example using cluster=, strata=, and fpc= directly into a replicate-weighted design, see

https://github.com/ajdamico/asdfree/blob/master/Censo%20Demografico/download%20and%20import.R#L405-L429

note you can also view minute-by-minute speed tests (with timestamps for each event) here http://monetdb.cwi.nl/testweb/web/eanthony/

also note that the replicates= argument is nearly 100% responsible for the speed that the design will run. so perhaps make two designs, one for coefficients (with just a couple of replicates) and another for SEs (with as many as you can tolerate). run your coefficients interactively and refine which numbers you need during the day, then leave the bigger processes that require SE calculations running overnight

0 讨论(0)
发布评论:

提交评论
- 加载中...