Dummify character column and find unique values [duplicate]

问题

I have a dataframe with the following structure

test <- data.frame(col = c('a; ff; cc; rr;', 'rr; a; cc; e;'))

Now I want to create a dataframe from this which contains a named column for each of the unique values in the test dataframe. A unique value is a value ended by the ';' character and starting with a space, not including the space. Then for each of the rows in the column I wish to fill the dummy columns with either a 1 or a 0. As given below

data.frame(a = c(1,1), ff = c(1,0), cc = c(1,1), rr = c(1,0), e = c(0,1))

  a ff cc rr e
1 1  1  1  1 0
2 1  0  1  1 1

I tried creating a df using for loops and the unique values in the column but it's getting to messy. I have a vector available containing the unique values of the column. The problem is how to create the ones and zeros. I tried some mutate_all() function with grep() but this did not work.

回答1:

I'd use splitstackshape and mtabulate from qdapTools packages to get this as a one liner, i.e.

library(splitstackshape)
library(qdapTools)

mtabulate(as.data.frame(t(cSplit(test, 'col', sep = ';', 'wide'))))
#   a cc ff rr e
#V1 1  1  1  1 0
#V2 1  1  0  1 1

It can also be full splitstackshape as @A5C1D2H2I1M1N2O1R2T1 mentions in comments,

cSplit_e(test, "col", ";", mode = "binary", type = "character", fill = 0)

回答2:

Here's a possible data.table implementation. First we split the rows into columns, melt into a single column and the spread it wide while counting the events for each row

library(data.table)
test2 <- setDT(test)[, tstrsplit(col, "; |;")]
dcast(melt(test2, measure = names(test2)), rowid(variable) ~ value, length)
#    variable a cc e ff rr
# 1:        1 1  1 0  1  1
# 2:        2 1  1 1  0  1

回答3:

We can do this with tidyverse

library(tidyverse)
rownames_to_column(test, 'grp') %>%
        separate_rows(col) %>% 
        filter(col!="")  %>% 
        count( grp, col) %>%
        spread(col, n, fill = 0) %>%
        ungroup() %>% 
        select(-grp)
# A tibble: 2 × 5
#      a    cc     e    ff    rr
#* <dbl> <dbl> <dbl> <dbl> <dbl>
#1     1     1     0     1     1
#2     1     1     1     0     1

回答4:

Here is a base R solution. First remove the space. Get all the unique combination. Split the actual data frame and then check presence of it in the cols which will have all the combo. Then you get a logical matrix which can be easily converted into numeric.

test=as.data.frame(apply(test,2,function(x)gsub('\\s+', '',x)))
cols=unique(unlist(strsplit(as.character(test$col), split = ';'))) 
yy=strsplit(as.character(test$col), split = ';') 
z=as.data.frame(do.call.rbind(lapply(yy, function(x) cols %in% x)))
names(z)=cols
z=as.data.frame(lapply(z, as.integer))

回答5:

Another approach with tidytext and tidyverse

library(tidyverse)
library(tidytext) #for unnest_tokens()
df <- test %>%
    unnest_tokens(word, col) %>%
    rownames_to_column(var="row") %>%
    mutate(row = floor(parse_number(row)),
           val = 1) %>%
    spread(word, val, fill = 0) %>%
    select(-row)
df
#    a cc e ff rr
#1   1  1 0  1  1
#2   1  1 1  0  1

回答6:

Here's a base R approach:

x   <- strsplit(as.character(test$col), ";\\s?") # split the strings
lvl <- unique(unlist(x))                         # get unique elements
x   <- lapply(x, factor, levels = lvl)           # convert to factor
t(sapply(x, table))                              # count elements and transpose
#     a ff cc rr e
#[1,] 1  1  1  1 0
#[2,] 1  0  1  1 1

回答7:

Another simple solution without any extra packages:

x =  c('a; ff; cc; rr;', 'rr; a; cc; e;')
G = lapply(strsplit(x,';'), trimws)
dict = sort(unique(unlist(G)))
do.call(rbind, lapply(G, function(g) 1*sapply(dict, function(d) d %in% g)))

来源：https://stackoverflow.com/questions/42387859/dummify-character-column-and-find-unique-values

标签

dummy-variable