Efficient Collapse Dummy Variables

前端 未结 2 1814
闹比i
闹比i 2021-01-18 07:00

What is an efficient way (any solution including non-base packages welcomed) to collapse dummy variables back into a factor.

   race.White race.Hispanic race         


        
相关标签:
2条回答
  • 2021-01-18 07:37

    Another idea:

    ff = function(x)
    {
        ans = integer(nrow(x))
        for(i in seq_along(x)) ans[as.logical(x[[i]])] = i
        names(x)[ans]
    }                                    
    sub("[^.]+\\.", "", ff(dat))
    #[1] "White"    "Asian"    "White"    "Black"    "Asian"    "Hispanic" "White"    "White"    "White"    "Black"
    

    And to compare with akrun's alternatives:

    akrun1 = function(x) names(x)[max.col(x, "first")]
    akrun2 = function(x) names(x)[(as.matrix(x) %*% seq_along(x))[, 1]]
    akrun3 = function(x) names(x)[do.call(pmax, x * seq_along(x)[col(x)])]
    akrunlike = function(x) names(x)[do.call(pmax, Map("*", x, seq_along(x)))]
    
    DF = setNames(as.data.frame("[<-"(matrix(0L, 1e4, 1e3), 
                                      cbind(seq_len(1e4), sample(1e3, 1e4, TRUE)), 
                                      1L)), 
                  paste("fac", 1:1e3, sep = ""))
    
    identical(ff(DF), akrun1(DF))
    #[1] TRUE
    identical(ff(DF), akrun2(DF))
    #[1] TRUE
    identical(ff(DF), akrun3(DF))
    #[1] TRUE
    identical(ff(DF), akrunlike(DF))
    #[1] TRUE
    microbenchmark::microbenchmark(ff(DF), akrun1(DF), akrun2(DF), 
                                   akrun3(DF), akrunlike(DF), 
                                   as.matrix(DF), col(DF), times = 30)
    #Unit: milliseconds
    #          expr        min         lq     median         uq        max neval
    #        ff(DF)   61.99124   64.56194   78.62267  102.18424  152.64891    30
    #    akrun1(DF)  296.89042  314.28641  327.95059  353.46185  394.46013    30
    #    akrun2(DF)  103.76105  114.01497  120.12191  129.86513  166.13266    30
    #    akrun3(DF) 1141.46478 1163.96842 1178.92961 1203.83848 1231.70346    30
    # akrunlike(DF)  125.47542  130.20826  141.66123  157.92743  203.42331    30
    # as.matrix(DF)   19.46940   20.54543   28.22377   35.69575   87.06001    30
    #       col(DF)  103.61454  112.75450  116.00120  126.09138  176.97435    30
    

    I included as.matrix() and col() just to show that "list"-y structures can be convenient on efficient looping as is. E.g., in contrast to a by-row looping, a way to use by-column looping doesn't need time to transform the structure of data.

    0 讨论(0)
  • 2021-01-18 07:49

    We can use max.col to get the column index, subset the column names based on that and use sub to remove the prefix.

    sub('[^.]+\\.', '', names(dat)[max.col(dat)])
    #[1] "White"    "Asian"    "White"    "Black"    "Asian"    "Hispanic"
    #[7] "White"    "White"    "White"    "Black"  
    

    Here, I assumed that there is a single 1 per each row. If there are multiple 1s, we can use the option ties.method='first' or ties.method='last'.


    Or another option is doing the %*% with the sequence of columns, subset the column names, and remove the prefix with sub.

     sub('[^.]+\\.', '', names(dat)[(as.matrix(dat) %*%seq_along(dat))[,1]])
    

    Or we can use pmax

    sub('[^.]+\\.', '', names(dat)[do.call(pmax,dat*seq_along(dat)[col(dat)])])
    
    0 讨论(0)
提交回复
热议问题