I am not a survey methodologist or demographer, but am an avid fan of Thomas Lumley\'s R survey package. I\'ve been working with a relatively large complex survey data set, the
Been a while, but closing the loop on this. As Dr. Lumley refers to in the recent comment above, Charco Hui resurrected the experimental sqlsurvey package as "svydb", which I've found to be a great tool for working with very large survey data sets in R. See a related post here: How to get svydb R package for large survey data sets to return standard errors
for huge data sets, linearized designs (svydesign
) are much slower than replication designs (svrepdesign
). review the weighting functions within survey::as.svrepdesign
and use one of them to directly make a replication design. you cannot use linearization for this task. and you are likely better off not even using as.svrepdesign
but instead using the functions within it.
for one example using cluster=
, strata=
, and fpc=
directly into a replicate-weighted design, see
https://github.com/ajdamico/asdfree/blob/master/Censo%20Demografico/download%20and%20import.R#L405-L429
note you can also view minute-by-minute speed tests (with timestamps for each event) here http://monetdb.cwi.nl/testweb/web/eanthony/
also note that the replicates=
argument is nearly 100% responsible for the speed that the design will run. so perhaps make two designs, one for coefficients (with just a couple of replicates) and another for SEs (with as many as you can tolerate). run your coefficients interactively and refine which numbers you need during the day, then leave the bigger processes that require SE calculations running overnight