R Frequency table for multiselect survey question across several columns

问题

I want to do a fairly common analysis of survey questions in R, but am stuck in the middle.

Imagine a survey where you are asked to answer which brands do you associate with certain features (e.g. "brands" could be PlayStation, XBox..., and features could be "speed", "graphics"... where each brand can be checked on several features aka mulit-select). E.g. sth. like this here: https://www.harvestyourdata.com/fileadmin/images/question-type-screenshots/Grid-multi-select.jpg

You often refer to these questions as multi-select grid or matrix questions.

Anyway, from a data perspective, this kind of data is usually stored in wide format where each row*column combination is one variable, which is 0/1 coded (0 if the survey participant doesn't check the box, 1 otherwise).

Assuming we have 5 brands and 10 items, we would have 50 variables in total, ideally following a nice, structured naming scheme, e.g. item1_column1, item2_column1, item3_column1, [...], item1_column2 and so on.

Now, I want to analyze (frequency table) all of these variables in one iteration. I've already found the cross.multi.table function in the questionr package. However, it only allows to analyze all items based on on single factor. What I need instead is to allow for several columns at the same time.

Any ideas? MIght be I'm missing a function from another package or this can easily be done with tidyverse or even with the cross.multi.table function?

Using this data as test input:

dat = data.frame(item1_column1 = c(0,1,1,1),
                 item2_column1 = c(1,1,1,0),
                 item3_column1 = c(0,0,1,1),
                 item1_column2 = c(1,1,1,0),
                 item2_column2 = c(0,1,1,1),
                 item3_column2 = c(1,0,1,1),
                 item1_column3 = c(0,1,1,0),
                 item2_column3 = c(1,1,1,1),
                 item3_column3 = c(0,0,1,0))

I'd expect this output:

         column1    column2    column3
item1    3          3          2
item2    3          3          4
item3    2          3          1

or ideally as proportions/percentages:

         column1    column2    column3
item1    75%        75%        50%
item2    75%        75%        100%
item3    50%        75%        25%

回答1:

One way could be to get data into long format using gather, separate columns based on _, group_by item and column and calculate the ratio of value column and spread the data to wide format.

library(dplyr)
library(tidyr)

dat %>%
  gather(key, value) %>%
  separate(key, into = c("item", "column"), sep = "_") %>%
  group_by(item, column) %>%
  summarise(prop = mean(value) * 100) %>%
  spread(column, prop)

#  item  column1 column2 column3
#  <chr>   <dbl>   <dbl>   <dbl>
#1 item1      75      75      50
#2 item2      75      75     100
#3 item3      50      75      25

A bit shorter (Thanks to @M-M)

dat %>%
  summarise_all(~mean(.) * 100) %>%
  gather(key, value) %>%
  separate(key, into = c("item", "column"), sep = "_") %>%
  spread(column, value)

回答2:

What I do here, by using data.table package, is summarizing each column, converting data to long format, breaking a column to two (item and column), and finally converting to wide format. Look below;

library(data.table)

dcast(setDT(melt(setDT(dat)[,100*colMeans(.SD),]),keep.rownames = T)[,
             c("item", "column") := tstrsplit(rn, "_", fixed=TRUE)],
                item ~ column, value.var = "value")

#>     item column1 column2 column3
#> 1: item1      75      75      50
#> 2: item2      75      75     100
#> 3: item3      50      75      25

回答3:

We can do this in base R, by creating a two column data.frame with the column names replicated, cbind with the unlisted values, and use xtabs to get the sum while pivoting to 'wide' format

out <- xtabs(val ~ ., cbind(read.table(text = names(dat)[col(dat)],
    sep="_", header = FALSE), val = unlist(dat, use.names = FALSE)))

out
#    V2
#V1      column1 column2 column3
#  item1       3       3       2
#  item2       3       3       4
#  item3       2       3       1

Or as @GKi mentioned (a compact version would be) to split the column names by _, create a data.frame with that along with colSums (or colMeans - for percentage) and use xtabs for pivoting

xtabs(n ~ ., data.frame(do.call("rbind", 
              strsplit(colnames(dat), "_")), n=colSums(dat)))

Or to get the percentage

xtabs(val ~ ., aggregate(val ~ ., cbind(read.table(text = names(dat)[col(dat)],
   sep="_", header = FALSE), val = unlist(dat, use.names = FALSE)), mean)) * 100
#     V2
#V1      column1 column2 column3
#  item1      75      75      50
#  item2      75      75     100
#  item3      50      75      25

Or inspired from @GKi, using enframe

library(dplyr)
library(tidyr)
library(tibble)
enframe(colSums(dat)) %>% 
   separate(name, into = c('name1', 'name2')) %>% 
   spread(name2, value)
# A tibble: 3 x 4
#  name1 column1 column2 column3
#  <chr>   <dbl>   <dbl>   <dbl>
#1 item1       3       3       2
#2 item2       3       3       4
#3 item3       2       3       1

To get the percentage, just change the first line of code to

enframe(100 *colMeans(dat))

来源：https://stackoverflow.com/questions/57166651/r-frequency-table-for-multiselect-survey-question-across-several-columns

标签

reshape

multi-select

survey