Pivoting rows into columns

后端未结

关注

 4  1584

我寻月下人不归 2021-02-05 15:57

Suppose (to simplify) I have a table containing some control vs. treatment data:

Which, Color, Response, Count
Control, Red, 2, 10
Control, Blue, 3, 20
Treatment


      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   被撕碎了的回忆
                                             
                
                
                (楼主)
            
              
              
                2021-02-05 16:18
              

            
            
                        
Reshape does indeed work for pivoting a skinny data frame (e.g., from a simple SQL query) to a wide matrix, and is very flexible, but it's slow.  For large amounts of data, very very slow.  Fortunately, if you only want to pivot to a fixed shape, it's fairly easy to write a little C function to do the pivot fast.

In my case, pivoting a skinny data frame with 3 columns and 672,338 rows took 34 seconds with reshape, 25 seconds with my R code, and 2.3 seconds with C.  Ironically, the C implementation was probably easier to write than my (tuned for speed) R implementation.

Here's the core C code for pivoting floating point numbers.  Note that it assumes that you have already allocated a correctly sized result matrix in R before calling the C code, which causes the R-devel folks to shudder in horror:

#include  
#include  
/* 
 * This mutates the result matrix in place.
 */
SEXP
dtk_pivot_skinny_to_wide(SEXP n_row  ,SEXP vi_1  ,SEXP vi_2  ,SEXP v_3  ,SEXP result)
{
   int ii, max_i;
   unsigned int pos;
   int nr = *INTEGER(n_row);
   int * aa = INTEGER(vi_1);
   int * bb = INTEGER(vi_2);
   double * cc = REAL(v_3);
   double * rr = REAL(result);
   max_i = length(vi_2);
   /*
    * R stores matrices by column.  Do ugly pointer-like arithmetic to
    * map the matrix to a flat vector.  We are translating this R code:
    *    for (ii in 1:length(vi.2))
    *       result[((n.row * (vi.2[ii] -1)) + vi.1[ii])] <- v.3[ii]
    */
   for (ii = 0; ii < max_i; ++ii) {
      pos = ((nr * (bb[ii] -1)) + aa[ii] -1);
      rr[pos] = cc[ii];
      /* printf("ii: %d \t value: %g \t result index:  %d \t new value: %g\n", ii, cc[ii], pos, rr[pos]); */
   }
   return(result);
}

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复