How to clean and re-code check-all-that-apply responses in R survey data?

问题

I've got survey data with some multiple-response questions like this:

HS18 Why is it difficult to get medical care in South Africa? (Select all that apply)

1   Too expensive
2   No transportation to the hospital/clinic
3   Hospital/clinic is too far away
4   Hospital/clinic staff do not speak my language
5   Hospital/clinic staff do not like foreigners
6   Wait time too long
7   Cannot take time off of work
8   None of these. I have no problem accessing medical care

where multiple responses were entered with commas and are recorded as different levels i.e.:

unique(HS18) [1] 888 1 6 4 5 8 2 3,5 4,6 3,6 3,4 3
[13] 4,5,6 7 999 4,5 2,6 4,8 7,8 1,6 1,2,3 5,7,8 4,5,6,7 1,4
[25] 0 5,6,7 5,6 2,3 1,4,6,7 1,4,5
30 Levels: 0 1 1,2,3 1,4 1,4,5 1,4,6,7 1,6 2 2,3 2,6 3 3,4 3,5 3,6 4 4,5 4,5,6 4,5,6,7 4,6 4,8 ... 999

This is as much a data-cleaning protocol question as an R question...I'm doing the cleaning, but not the analysis, so everything needs to be transparent and user-friendly when I pass it back...and the PI doesn't use R. Basically I'd like to split the multiples into levels and re-name them while keeping them together as a single observation...not sure how to do this, or even if it's the right approach.

How do you generally deal with this issue? Is there an elegant way to process this for analysis in STATA (simple descriptives, regressions, odds ratios)?

Thanks everyone!!!

回答1:

My best thought for analyzing multi-select questions like this is to convert the possible answers into indicator variables: take all of your possible answers (1 to 8 in this example) and create data columns named HS18.1, HS18.2, etc. (You can optionally include something more in the column name, but that's completely between you and the PI.)

Your sample data here looks like it includes data that is not legal: 0, 888, and 999 are not listed in the options. It's possible/likely that these include DK/NR responses, but I can't be certain. As such:

Your data cleaning should be taking care of these anomalies before this step of converting 0+ length lists into indicator variables.
My code below arbitrarily ignores this fact and you will lose data. This is obviously not "A Good Thing™" in the long run. More robust checks are warranted (and not difficult). (I've added an other column to indicate something was lost.)

The code:

ss <- '888 1 6 4 5 8 2 3,5 4,6 3,6 3,4 3 4,5,6 7 999 4,5 2,6 4,8 7,8 1,6 1,2,3 5,7,8 4,5,6,7 1,4 0 5,6,7 5,6 2,3 1,4,6,7 1,4,5'
dat <- lapply(strsplit(ss, ' '), strsplit, ',')[[1]]
lvls <- as.character(1:8)
## lvls <- sort(unique(unlist(dat))) # alternative method
ret <- structure(lapply(lvls, function(lvl) sapply(dat, function(xx) lvl %in% xx)),
                 .Names = paste0('HS18.', lvls),
                 row.names = c(NA, -length(dat)), class = 'data.frame')
ret$HS18.other <- sapply(dat, function(xx) !all(xx %in% lvls))
ret <- 1 * ret ## convert from TRUE/FALSE to 1/0
head(1 * ret)
##   HS18.1 HS18.2 HS18.3 HS18.4 HS18.5 HS18.6 HS18.7 HS18.8 HS18.other
## 1      0      0      0      0      0      0      0      0          1
## 2      1      0      0      0      0      0      0      0          0
## 3      0      0      0      0      0      1      0      0          0
## 4      0      0      0      1      0      0      0      0          0
## 5      0      0      0      0      1      0      0      0          0
## 6      0      0      0      0      0      0      0      1          0

The resulting data.frame can be cbinded (or even matrixized) to whatever other data you have.

(I use 1 and 0 instead of TRUE and FALSE because you said the PI will not be using R; this can easily be changed to a character string or something that makes more sense to them.)

来源：https://stackoverflow.com/questions/27489277/how-to-clean-and-re-code-check-all-that-apply-responses-in-r-survey-data

标签

survey

data-cleaning