How to use dplyr as alternative to aggregate

前端未结

关注

 2  1115

I have a dataframe times that looks like this:

user     time
A        7/7/2010
B        7/12/2010
C        7/12/2010
A        7/12/2010 
C        7/


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  后悔当初        
                
              
                            
                2021-01-24 11:56
              
            
            
                                                                       
Updated Answer: Based on your comment, how about this:

library(dplyr)

# Data (with a few additions)
times = read.table(text="user     time
A        7/7/2010
B        7/12/2010
B 7/13/2010
C        7/12/2010
A        7/12/2010 
A 7/13/2010
C        7/15/2010", header=TRUE, stringsAsFactors=FALSE)

times$time = as.Date(times$time, "%m/%d/%Y")

times



  user       time
1    A 2010-07-07
2    B 2010-07-12
3    B 2010-07-13
4    C 2010-07-12
5    A 2010-07-12
6    A 2010-07-13
7    C 2010-07-15



times %>% group_by(user) %>%
  summarise(First=min(time),
            Last=max(time),
            N = n(),
            minDiff=min(diff(time)),
            meanDiff=mean(diff(time)),
            NumDiffUniq = length(unique(diff(time))))



   user      First       Last     N        minDiff       meanDiff NumDiffUniq
1     A 2010-07-07 2010-07-13     3         1 days         3 days           2
2     B 2010-07-12 2010-07-13     2         1 days         1 days           1
3     C 2010-07-12 2010-07-15     2         3 days         3 days           1



Original Answer:

I'm not clear on what you're trying to accomplish. If you just want your data frame to be sorted, then with dplyr you would do:

library(dplyr)

times.sorted = times %>% arrange(user, time)


If you want time to become a string of dates for each user, then you could do:

times.summary = times %>% group_by(user) %>%
  summarise(time = paste(time, collapse=","))


But note that for each user this will result in a single string containing the dates.

times.summary



   user                time
1     A  7/7/2010,7/12/2010
2     B           7/12/2010
3     C 7/12/2010,7/15/2010



If you actually want each cell to be a vector of dates, you could make each cell a list (though there might be a better way). For example:

times.new = times %>% group_by(user) %>%
  summarise(time = list(as.vector(time)))

times.new$time



[[1]]
[1] "7/7/2010"  "7/12/2010"

[[2]]
[1] "7/12/2010"

[[3]]
[1] "7/12/2010" "7/15/2010"



But if your goal is to analyze your data by group, then you don't actually need to do any of the above. You can use base, dplyr, or data.table functions to perform any analysis by group without first sorting your data.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  清歌不尽        
                
              
                            
                2021-01-24 12:01
              
            
            
                                                                       
Based on the dplyr solution by eipi10  and the suggestion of nrussell, I've written the following solution using data.table.

First you need to format the variable times:

times$time = as.Date(times$time, "%m/%d/%Y")


Then you'll need to convert times to a data.table using:

library(data.table)
times <- as.data.table(times)


Overwriting times was useful for my purposes but you may want to instantiate a new variable.  After formatting your dataframe as a data.table just do:

new.times <- times[, 
                    .(first = min(time),
                      last = max(time),
                      n = .N,
                      meandiff = mean(diff(time)),
                      mindiff = min(diff(time)),
                      numdiffuniq = length(unique(diff(time))),
                      by='user')]


Running on a linux virtual machine with 128G RAM and using a sample of 1000 entires, the elapsed runtime was 0.43s. 

See this tutorial for more on data.table.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复