Efficient Collapse Dummy Variables

前端未结

关注

 2  1817

What is an efficient way (any solution including non-base packages welcomed) to collapse dummy variables back into a factor.

   race.White race.Hispanic race


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  你的背包        
                
              
                            
                2021-01-18 07:37
              
            
            
                                                                       
Another idea:

ff = function(x)
{
    ans = integer(nrow(x))
    for(i in seq_along(x)) ans[as.logical(x[[i]])] = i
    names(x)[ans]
}                                    
sub("[^.]+\\.", "", ff(dat))
#[1] "White"    "Asian"    "White"    "Black"    "Asian"    "Hispanic" "White"    "White"    "White"    "Black"


And to compare with akrun's alternatives:

akrun1 = function(x) names(x)[max.col(x, "first")]
akrun2 = function(x) names(x)[(as.matrix(x) %*% seq_along(x))[, 1]]
akrun3 = function(x) names(x)[do.call(pmax, x * seq_along(x)[col(x)])]
akrunlike = function(x) names(x)[do.call(pmax, Map("*", x, seq_along(x)))]

DF = setNames(as.data.frame("[<-"(matrix(0L, 1e4, 1e3), 
                                  cbind(seq_len(1e4), sample(1e3, 1e4, TRUE)), 
                                  1L)), 
              paste("fac", 1:1e3, sep = ""))

identical(ff(DF), akrun1(DF))
#[1] TRUE
identical(ff(DF), akrun2(DF))
#[1] TRUE
identical(ff(DF), akrun3(DF))
#[1] TRUE
identical(ff(DF), akrunlike(DF))
#[1] TRUE
microbenchmark::microbenchmark(ff(DF), akrun1(DF), akrun2(DF), 
                               akrun3(DF), akrunlike(DF), 
                               as.matrix(DF), col(DF), times = 30)
#Unit: milliseconds
#          expr        min         lq     median         uq        max neval
#        ff(DF)   61.99124   64.56194   78.62267  102.18424  152.64891    30
#    akrun1(DF)  296.89042  314.28641  327.95059  353.46185  394.46013    30
#    akrun2(DF)  103.76105  114.01497  120.12191  129.86513  166.13266    30
#    akrun3(DF) 1141.46478 1163.96842 1178.92961 1203.83848 1231.70346    30
# akrunlike(DF)  125.47542  130.20826  141.66123  157.92743  203.42331    30
# as.matrix(DF)   19.46940   20.54543   28.22377   35.69575   87.06001    30
#       col(DF)  103.61454  112.75450  116.00120  126.09138  176.97435    30


I included as.matrix() and col() just to show that "list"-y structures can be convenient on efficient looping as is. E.g., in contrast to a by-row looping, a way to use by-column looping doesn't need time to transform the structure of data.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  慢半拍i        
                
              
                            
                2021-01-18 07:49
              
            
            
                                                                       
We can use max.col to get the column index, subset the column names based on that and use sub to remove the prefix.  

sub('[^.]+\\.', '', names(dat)[max.col(dat)])
#[1] "White"    "Asian"    "White"    "Black"    "Asian"    "Hispanic"
#[7] "White"    "White"    "White"    "Black"  


Here, I assumed that there is a single 1 per each row.  If there are multiple 1s, we can use the option ties.method='first' or ties.method='last'.



Or another option is doing the %*% with the sequence of columns, subset the column names, and remove the prefix with sub.

 sub('[^.]+\\.', '', names(dat)[(as.matrix(dat) %*%seq_along(dat))[,1]])




Or we can use pmax

sub('[^.]+\\.', '', names(dat)[do.call(pmax,dat*seq_along(dat)[col(dat)])])

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复