Split a column of concatenated comma-delimited data and recode output as factors

前端未结

关注

 2  791

I am trying to clean up some data that has been incorrectly entered. The question for the variable allows for multiple responses out of five choices, numbered as 1 to 5. The

相关标签:

2条回答

南笙

2020-11-27 08:24

You just need to write a function and use apply. First some dummy data:

##Make sure you're not using factors
dd = data.frame(V1 = c("1, 2, 3", "1, 2, 4", "2, 3, 4, 5", 
                         "1, 3, 4", "1, 3, 5", "2, 3, 4, 5"), 
                     stringsAsFactors=FALSE)

Next, create a function that takes in a row and transforms as necessary

make_row = function(i, ncol=5) {
  ##Could make the default NA if needed
  m = numeric(ncol)
  v = as.numeric(strsplit(i, ",")[[1]])
  m[v] = 1
  return(m)
}

Then use apply and transpose the result

t(apply(dd, 1, make_row))

0 讨论(0)

离开以前

2020-11-27 08:36

A long time later, I finally got around to creating a package ("splitstackshape") that deals with this kind of data in an efficient manner. So, for the convenience of others (and some self-promotion, of course) here's a compact solution.

The relevant function for this problem is cSplit_e.

First, the default settings, which retains the original column and uses NA as the fill:

library(splitstackshape)
cSplit_e(data, "V1")
#           V1 V1_1 V1_2 V1_3 V1_4 V1_5
# 1    1, 2, 3    1    1    1   NA   NA
# 2    1, 2, 4    1    1   NA    1   NA
# 3 2, 3, 4, 5   NA    1    1    1    1
# 4    1, 3, 4    1   NA    1    1   NA
# 5    1, 3, 5    1   NA    1   NA    1
# 6 2, 3, 4, 5   NA    1    1    1    1

Second, with dropping the original column and using 0 as the fill.

cSplit_e(data, "V1", drop = TRUE, fill = 0)
#   V1_1 V1_2 V1_3 V1_4 V1_5
# 1    1    1    1    0    0
# 2    1    1    0    1    0
# 3    0    1    1    1    1
# 4    1    0    1    1    0
# 5    1    0    1    0    1
# 6    0    1    1    1    1

0 讨论(0)