Create new dummy variable columns from categorical variable

后端 未结 8 1023
礼貌的吻别
礼貌的吻别 2020-11-28 04:03

I have a several data sets with 75,000 observations and a type variable that can take on a value 0-4. I want to add five new dummy variables to each data set f

相关标签:
8条回答
  • 2020-11-28 04:38

    Drew, this is much faster and shouldn't cause any crashes.

    > binom <- data.frame(data=runif(1e5),type=sample(0:4,1e5,TRUE))
    > for(t in unique(binom$type)) {
    +   binom[paste("type",t,sep="")] <- ifelse(binom$type==t,1,0)
    + }
    > head(binom)
            data type type2 type4 type1 type3 type0
    1 0.11787309    2     1     0     0     0     0
    2 0.11884046    4     0     1     0     0     0
    3 0.92234950    4     0     1     0     0     0
    4 0.44759259    1     0     0     1     0     0
    5 0.01669651    2     1     0     0     0     0
    6 0.33966184    3     0     0     0     1     0
    
    0 讨论(0)
  • 2020-11-28 04:39

    ifelse is vectorized, so if I understand your code correctly, you don't need that sapply. And I wouldn't use merge - I would use SQLite or PostgreSQL.

    Some sample data would help too :-)

    0 讨论(0)
  • 2020-11-28 04:42

    If you're open to using the data.table package, mltools has a one_hot() method.

    library(data.table)
    library(mltools)
    
    binom <- data.table(y=runif(1e5), x=runif(1e5), catVar=as.factor(sample(0:4,1e5,TRUE)))
    one_hot(binom)
    
                     y          x catVar_0 catVar_1 catVar_2 catVar_3 catVar_4
         1: 0.90511891 0.83045050        0        0        1        0        0
         2: 0.91375984 0.73273830        0        0        0        1        0
         3: 0.01926608 0.10301409        0        0        1        0        0
         4: 0.48691138 0.24428157        0        1        0        0        0
         5: 0.60660396 0.09132816        0        0        1        0        0
        ---                                                                   
     99996: 0.12908356 0.26157731        0        1        0        0        0
     99997: 0.96397273 0.98959000        0        1        0        0        0
     99998: 0.16818414 0.37460941        1        0        0        0        0
     99999: 0.72610508 0.72055867        1        0        0        0        0
    100000: 0.89710998 0.24155507        0        0        0        0        1
    

    Usage

    one_hot(dt, cols = "auto", sparsifyNAs = FALSE, 
            naCols = FALSE, dropCols = TRUE,
            dropUnusedLevels = FALSE)
    

    Which column(s) should be one-hot-encoded? cols = "auto" encodes all unordered factor columns. Therefore, the command below is equivalent. This is only important when the data.table contains factors that should not be encoded.

    one_hot(binom, cols="catVar")
    
    0 讨论(0)
  • 2020-11-28 04:48

    You can use the package called dummies

    binom <- data.frame(y=runif(1e5), x=runif(1e5), catVar=as.factor(sample(0:4,1e5,TRUE)))
    head(binom)
    
              y          x catVar
    1 0.4143348 0.09721401      1
    2 0.3140782 0.54340539      3
    3 0.1262037 0.51820499      2
    4 0.7159850 0.13167720      3
    5 0.8203528 0.94116026      3
    6 0.2169781 0.82020216      1
    

    Solution:

    library(dummies)
    binom<-dummy.data.frame(binom)
    head(binom)
    
              y          x catVar0 catVar1 catVar2 catVar3 catVar4
    1 0.4143348 0.09721401       0       1       0       0       0
    2 0.3140782 0.54340539       0       0       0       1       0
    3 0.1262037 0.51820499       0       0       1       0       0
    4 0.7159850 0.13167720       0       0       0       1       0
    5 0.8203528 0.94116026       0       0       0       1       0
    6 0.2169781 0.82020216       0       1       0       0       0
    
    0 讨论(0)
  • 2020-11-28 04:49

    The nnet package for single-layer neural networks (which don't understand factors) has a conversion command: class.ind.

    0 讨论(0)
  • 2020-11-28 04:56

    R has a "sub-language" to translate formulas into design matrix, and in the spirit of the language you can take advantage of it. It's fast and concise. Example: you have a cardinal predictor x, a categorical predictor catVar, and a response y.

    > binom <- data.frame(y=runif(1e5), x=runif(1e5), catVar=as.factor(sample(0:4,1e5,TRUE)))
    > head(binom)
              y          x catVar
    1 0.5051653 0.34888390      2
    2 0.4868774 0.85005067      2
    3 0.3324482 0.58467798      2
    4 0.2966733 0.05510749      3
    5 0.5695851 0.96237936      1
    6 0.8358417 0.06367418      2
    

    You just do

    > A <- model.matrix(y ~ x + catVar,binom) 
    > head(A)
      (Intercept)          x catVar1 catVar2 catVar3 catVar4
    1           1 0.34888390       0       1       0       0
    2           1 0.85005067       0       1       0       0
    3           1 0.58467798       0       1       0       0
    4           1 0.05510749       0       0       1       0
    5           1 0.96237936       1       0       0       0
    6           1 0.06367418       0       1       0       0
    

    Done.

    0 讨论(0)
提交回复
热议问题