How to one-hot-encode factor variables with data.table?

前端 未结 5 1575
我寻月下人不归
我寻月下人不归 2020-12-05 16:28

For those unfamiliar, one-hot encoding simply refers to converting a column of categories (i.e. a factor) into multiple columns of binary indicator variables where each new

相关标签:
5条回答
  • 2020-12-05 16:35

    If no one posts a clean way to write this out by hand each time, you can always make a function/macro:

    OHE <- function(dt, grp, encodeCols) {
            grpSymb = as.symbol(grp)
            for (col in encodeCols) {
                    colSymb = as.symbol(col)
                    eval(bquote(
                                dt[, .SD
                                   ][, V1 := 1
                                   ][, dcast(.SD, .(grpSymb) ~ .(colSymb), fun=sum, value.var='V1')
                                   ][, setnames(.SD, setdiff(colnames(.SD), grp), sprintf("%s_%s", col, setdiff(colnames(.SD), grp)))
                                   ][, dt <<- dt[.SD, on=grp]
                                   ]
                                ))
            }
            dt
    }
    
    dtOHE = OHE(dt, 'ID', c('Color', 'Shape'))
    dtOHE
    
       ID Color    Shape Color_blue Color_green Color_red Shape_cirlce Shape_square Shape_triangle
    1:  1 green   square          0           1         0            0            1              0
    2:  2   red triangle          0           0         1            0            0              1
    3:  3   red   square          0           0         1            0            1              0
    4:  4  blue triangle          1           0         0            0            0              1
    5:  5 green   cirlce          0           1         0            1            0              0
    
    0 讨论(0)
  • 2020-12-05 16:38

    Here you go:

    dcast(melt(dt, id.vars='ID'), ID ~ variable + value, fun = length)
    #   ID Color_blue Color_green Color_red Shape_cirlce Shape_square Shape_triangle
    #1:  1          0           1         0            0            1              0
    #2:  2          0           0         1            0            0              1
    #3:  3          0           0         1            0            1              0
    #4:  4          1           0         0            0            0              1
    #5:  5          0           1         0            1            0              0
    

    To get the missing factors you can do the following:

    res = dcast(melt(dt, id = 'ID', value.factor = T), ID ~ value, drop = F, fun = length)
    setnames(res, c("ID", unlist(lapply(2:ncol(dt),
                                 function(i) paste(names(dt)[i], levels(dt[[i]]), sep = "_")))))
    res
    #   ID Color_blue Color_green Color_red Color_purple Shape_cirlce Shape_square Shape_triangle
    #1:  1          0           1         0            0            0            1              0
    #2:  2          0           0         1            0            0            0              1
    #3:  3          0           0         1            0            0            1              0
    #4:  4          1           0         0            0            0            0              1
    #5:  5          0           1         0            0            1            0              0
    
    0 讨论(0)
  • 2020-12-05 16:39

    Here's a more generalized version of eddi's solution:

    one_hot <- function(dt, cols="auto", dropCols=TRUE, dropUnusedLevels=FALSE){
      # One-Hot-Encode unordered factors in a data.table
      # If cols = "auto", each unordered factor column in dt will be encoded. (Or specifcy a vector of column names to encode)
      # If dropCols=TRUE, the original factor columns are dropped
      # If dropUnusedLevels = TRUE, unused factor levels are dropped
    
      # Automatically get the unordered factor columns
      if(cols[1] == "auto") cols <- colnames(dt)[which(sapply(dt, function(x) is.factor(x) & !is.ordered(x)))]
    
      # Build tempDT containing and ID column and 'cols' columns
      tempDT <- dt[, cols, with=FALSE]
      tempDT[, ID := .I]
      setcolorder(tempDT, unique(c("ID", colnames(tempDT))))
      for(col in cols) set(tempDT, j=col, value=factor(paste(col, tempDT[[col]], sep="_"), levels=paste(col, levels(tempDT[[col]]), sep="_")))
    
      # One-hot-encode
      if(dropUnusedLevels == TRUE){
        newCols <- dcast(melt(tempDT, id = 'ID', value.factor = T), ID ~ value, drop = T, fun = length)
      } else{
        newCols <- dcast(melt(tempDT, id = 'ID', value.factor = T), ID ~ value, drop = F, fun = length)
      }
    
      # Combine binarized columns with the original dataset
      result <- cbind(dt, newCols[, !"ID"])
    
      # If dropCols = TRUE, remove the original factor columns
      if(dropCols == TRUE){
        result <- result[, !cols, with=FALSE]
      }
    
      return(result)
    }
    

    Note that for large datasets it's probably better to use Matrix::sparse.model.matrix

    Update (2017)

    This is now in the package mltools.

    0 讨论(0)
  • 2020-12-05 16:43

    In few lines you can solve this problem:

    library(tidyverse)
    dt2 <- spread(dt,Color,Shape)
    dt3 <- spread(dt,Shape,Color)
    
    df <- cbind(dt2,dt3)
    
    df2 <- apply(df, 2, function(x){sapply(x, function(y){
      ifelse(is.na(y),0,1)
    })})
    
    df2 <- as.data.frame(df2)
    
    df <- cbind(dt,df2[,-1])
    

    table image

    0 讨论(0)
  • 2020-12-05 16:53

    Using model.matrix:

    > cbind(dt[, .(ID)], model.matrix(~ Color + Shape, dt))
       ID (Intercept) Colorgreen Colorred Colorpurple Shapesquare Shapetriangle
    1:  1           1          1        0           0           1             0
    2:  2           1          0        1           0           0             1
    3:  3           1          0        1           0           1             0
    4:  4           1          0        0           0           0             1
    5:  5           1          1        0           0           0             0
    

    This makes the most sense if you're doing modelling.

    If you want to suppress the intercept (and restore the aliased column for the 1st variable):

    > cbind(dt[, .(ID)], model.matrix(~ Color + Shape - 1, dt))
       ID Colorblue Colorgreen Colorred Colorpurple Shapesquare Shapetriangle
    1:  1         0          1        0           0           1             0
    2:  2         0          0        1           0           0             1
    3:  3         0          0        1           0           1             0
    4:  4         1          0        0           0           0             1
    5:  5         0          1        0           0           0             0
    
    0 讨论(0)
提交回复
热议问题