Split thousands of columns at a time by '/' on multiple lines, sort the values in the new rows and add 'NA' values

前端未结

关注

 4  1676

逝去的感伤 2021-01-22 21:24

I would like to split a data frame with thousands of columns. The data frame looks like this:

# sample data of four columns
sample <-read.table(stdin(),header


      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   梦毁少年i
                                             
                
                
                (楼主)
            
              
              
                2021-01-22 21:52
              

            
            
                        
I guess the interesting data is really a matrix

m = as.matrix(sample[,-1])


The underlying data is then a vector with relatively few unique values; we'll map the unique values to their integer representation, using the map where possible to minimize the number of iterations in any loops that are necessary

s = as.character(m)
map = lapply(strsplit(setNames(unique(s), unique(s)), "/"), as.integer)


Here's the number of times each row needs to be replicated

row.len = apply(matrix(sapply(map, max)[s], ncol=ncol(m)), 1, max) + 1


and the offsets into s of each row

offset = head(c(1, cumsum(rep(row.len, ncol(m))) + 1), -1)


Calculate the values of each mapped element, and the index of the value in s

v = unlist(unname(map)[match(s, names(map))])
idx = rep(offset, sapply(map, length)[s]) + v


Finally, allocate the result matrix of NA's, and update the non-NA values

ans = matrix(NA_integer_, sum(row.len), ncol(m))
ans[idx] = v


As a function:

flatten <- function(sample) {
    m = as.matrix(sample[,-1])
    s = as.character(m)
    map = lapply(strsplit(setNames(unique(s), unique(s)), "/"), as.integer)
    row.len = apply(matrix(sapply(map, max)[s], ncol=ncol(m)), 1, max) + 1
    offset = head(c(1, cumsum(rep(row.len, ncol(m))) + 1), -1)
    v = unlist(unname(map)[match(s, names(map))])
    idx = rep(offset, sapply(map, length)[s]) + v
    ans = matrix(NA_integer_, sum(row.len), ncol(m),
        dimnames=list(NULL, colnames(sample)[-1]))
    ans[idx] = v
    cbind(POS=rep(sample[,1], row.len), as.data.frame(ans))
}


The slowest part of this will be the apply function for calculating row.len.  Some timing (I guess the dimensions are not correct for the problem...)

xx = do.call(rbind, replicate(10000, sample, simplify=FALSE))
dim(xx)
## [1] 30000     5
system.time(flatten(xx))
##   user  system elapsed 
##  0.192   0.000   0.194 


versus about 5s for the data.table solution above.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复