问题
I have a data frame with a categorical variable holding lists of strings, with variable length (it is important because otherwise this question would be a duplicate of this or this), e.g.:
df <- data.frame(x = 1:5)
df$y <- list("A", c("A", "B"), "C", c("B", "D", "C"), "E")
df
x y 1 1 A 2 2 A, B 3 3 C 4 4 B, D, C 5 5 E
And the desired form is a dummy variable for each unique string seen anywhere in df$y
, i.e.:
data.frame(x = 1:5, A = c(1,1,0,0,0), B = c(0,1,0,1,0), C = c(0,0,1,1,0), D = c(0,0,0,1,0), E = c(0,0,0,0,1))
x A B C D E 1 1 1 0 0 0 0 2 2 1 1 0 0 0 3 3 0 0 1 0 0 4 4 0 1 1 1 0 5 5 0 0 0 0 1
This naive approach works:
> uniqueStrings <- unique(unlist(df$y))
> n <- ncol(df)
> for (i in 1:length(uniqueStrings)) {
+ df[, n + i] <- sapply(df$y, function(x) ifelse(uniqueStrings[i] %in% x, 1, 0))
+ colnames(df)[n + i] <- uniqueStrings[i]
+ }
However it is very ugly, lazy and slow with big data frames.
Any suggestions? Something fancy from the tidyverse
?
UPDATE: I got 3 different approaches below. I tested them using system.time
on my (Windows 7, 32GB RAM) laptop on a real dataset, comprising of 1M rows, each row containing a list of length 1 to 4 strings (out of ~350 unique string values), overall 200MB on disk. So the expected result is a data frame with dimensions 1M x 350. The tidyverse
(@Sotos) and base
(@joel.wilson) approaches took so long I had to restart R. The qdapTools
(@akrun) approach however worked fantastic:
> system.time(res1 <- mtabulate(varsLists))
user system elapsed
47.05 10.27 116.82
So this is the approach I'll mark accepted.
回答1:
We can use mtabulate
library(qdapTools)
cbind(df[1], mtabulate(df$y))
# x A B C D E
#1 1 1 0 0 0 0
#2 2 1 1 0 0 0
#3 3 0 0 1 0 0
#4 4 0 1 1 1 0
#5 5 0 0 0 0 1
回答2:
Another idea,
library(dplyr)
library(tidyr)
df %>%
unnest(y) %>%
mutate(new = 1) %>%
spread(y, new, fill = 0)
# x A B C D E
#1 1 1 0 0 0 0
#2 2 1 1 0 0 0
#3 3 0 0 1 0 0
#4 4 0 1 1 1 0
#5 5 0 0 0 0 1
Further to the cases you mentioned in comments, we can use dcast
from reshape2
as it is more flexible than spread
,
df2 <- df %>%
unnest(y) %>%
group_by(x) %>%
filter(!duplicated(y)) %>%
ungroup()
reshape2::dcast(df2, x ~ y, value.var = 'y', length)
# x A B C D E
#1 1 1 0 0 0 0
#2 2 1 1 0 0 0
#3 3 0 0 1 0 0
#4 4 0 1 1 1 0
#5 5 0 0 0 0 1
#or with df$x <- c(1, 1, 2, 2, 3)
# x A B C D E
#1 1 1 1 0 0 0
#2 2 0 1 1 1 0
#3 3 0 0 0 0 1
#or with df$x <- rep(1,5)
# x A B C D E
#1 1 1 1 1 1 1
回答3:
this involves no external packages,
# thanks to Sotos for suggesting to use `unique(unlist(df$y))` instead of `LETTERS[1!:5]`
sapply(unique(unlist(df$y)), function(j) as.numeric(grepl(j, df$y)))
# A B C D E
#[1,] 1 0 0 0 0
#[2,] 1 1 0 0 0
#[3,] 0 0 1 0 0
#[4,] 0 1 1 1 0
#[5,] 0 0 0 0 1
来源:https://stackoverflow.com/questions/41672579/r-create-dummy-variables-based-on-a-categorical-variable-of-lists