问题
I want to do a fairly common analysis of survey questions in R, but am stuck in the middle.
Imagine a survey where you are asked to answer which brands do you associate with certain features (e.g. "brands" could be PlayStation, XBox..., and features could be "speed", "graphics"... where each brand can be checked on several features aka mulit-select). E.g. sth. like this here: https://www.harvestyourdata.com/fileadmin/images/question-type-screenshots/Grid-multi-select.jpg
You often refer to these questions as multi-select grid or matrix questions.
Anyway, from a data perspective, this kind of data is usually stored in wide format where each row*column combination is one variable, which is 0/1 coded (0 if the survey participant doesn't check the box, 1 otherwise).
Assuming we have 5 brands and 10 items, we would have 50 variables in total, ideally following a nice, structured naming scheme, e.g. item1_column1, item2_column1, item3_column1, [...], item1_column2 and so on.
Now, I want to analyze (frequency table) all of these variables in one iteration. I've already found the cross.multi.table function in the questionr package. However, it only allows to analyze all items based on on single factor. What I need instead is to allow for several columns at the same time.
Any ideas? MIght be I'm missing a function from another package or this can easily be done with tidyverse or even with the cross.multi.table function?
Using this data as test input:
dat = data.frame(item1_column1 = c(0,1,1,1),
item2_column1 = c(1,1,1,0),
item3_column1 = c(0,0,1,1),
item1_column2 = c(1,1,1,0),
item2_column2 = c(0,1,1,1),
item3_column2 = c(1,0,1,1),
item1_column3 = c(0,1,1,0),
item2_column3 = c(1,1,1,1),
item3_column3 = c(0,0,1,0))
I'd expect this output:
column1 column2 column3
item1 3 3 2
item2 3 3 4
item3 2 3 1
or ideally as proportions/percentages:
column1 column2 column3
item1 75% 75% 50%
item2 75% 75% 100%
item3 50% 75% 25%
回答1:
One way could be to get data into long format using gather
, separate
columns based on _
, group_by
item
and column
and calculate the ratio of value
column and spread
the data to wide format.
library(dplyr)
library(tidyr)
dat %>%
gather(key, value) %>%
separate(key, into = c("item", "column"), sep = "_") %>%
group_by(item, column) %>%
summarise(prop = mean(value) * 100) %>%
spread(column, prop)
# item column1 column2 column3
# <chr> <dbl> <dbl> <dbl>
#1 item1 75 75 50
#2 item2 75 75 100
#3 item3 50 75 25
A bit shorter (Thanks to @M-M)
dat %>%
summarise_all(~mean(.) * 100) %>%
gather(key, value) %>%
separate(key, into = c("item", "column"), sep = "_") %>%
spread(column, value)
回答2:
What I do here, by using data.table
package, is summarizing each column, converting data to long format, breaking a column to two (item
and column
), and finally converting to wide format. Look below;
library(data.table)
dcast(setDT(melt(setDT(dat)[,100*colMeans(.SD),]),keep.rownames = T)[,
c("item", "column") := tstrsplit(rn, "_", fixed=TRUE)],
item ~ column, value.var = "value")
#> item column1 column2 column3
#> 1: item1 75 75 50
#> 2: item2 75 75 100
#> 3: item3 50 75 25
回答3:
We can do this in base R
, by creating a two column data.frame with the column names replicated, cbind
with the unlist
ed values, and use xtabs
to get the sum
while pivoting to 'wide' format
out <- xtabs(val ~ ., cbind(read.table(text = names(dat)[col(dat)],
sep="_", header = FALSE), val = unlist(dat, use.names = FALSE)))
out
# V2
#V1 column1 column2 column3
# item1 3 3 2
# item2 3 3 4
# item3 2 3 1
Or as @GKi mentioned (a compact version would be) to split the column names by _
, create a data.frame
with that along with colSums
(or colMeans
- for percentage) and use xtabs
for pivoting
xtabs(n ~ ., data.frame(do.call("rbind",
strsplit(colnames(dat), "_")), n=colSums(dat)))
Or to get the percentage
xtabs(val ~ ., aggregate(val ~ ., cbind(read.table(text = names(dat)[col(dat)],
sep="_", header = FALSE), val = unlist(dat, use.names = FALSE)), mean)) * 100
# V2
#V1 column1 column2 column3
# item1 75 75 50
# item2 75 75 100
# item3 50 75 25
Or inspired from @GKi, using enframe
library(dplyr)
library(tidyr)
library(tibble)
enframe(colSums(dat)) %>%
separate(name, into = c('name1', 'name2')) %>%
spread(name2, value)
# A tibble: 3 x 4
# name1 column1 column2 column3
# <chr> <dbl> <dbl> <dbl>
#1 item1 3 3 2
#2 item2 3 3 4
#3 item3 2 3 1
To get the percentage, just change the first line of code to
enframe(100 *colMeans(dat))
来源:https://stackoverflow.com/questions/57166651/r-frequency-table-for-multiselect-survey-question-across-several-columns