How to calculate number of occurrences per minute for a large dataset

前端未结

关注

 3  725

I have a dataset with 500k appointments lasting between 5 and 60 minutes.

tdata <- structure(list(Start = structure(c(1325493000, 1325493600, 1325494200,


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  旧时难觅i        
                
              
                            
                2021-01-06 03:01
              
            
            
                                                                       
Is this any better. 

Create a blank time vector and a blank count vector.

 vecTime <- seq(from=tdata$Start[1],to=tdata$End[nrow(tdata)],by=60)
 addz <- array(0,length(vecTime))
 remz <- array(0,length(vecTime))


 startAgg <- aggregate(tdata$Start,by=list(tdata$Start),length)
 endAgg <- aggregate(tdata$End,by=list(tdata$End),length)
 addz[which(vecTime %in% startAgg$Group.1 )] <- startAgg$x
 remz[which(vecTime %in% endAgg$Group.1)] <- -endAgg$x
 res <- data.frame(time=vecTime,occupancy=cumsum(addz + remz))

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  再見小時候        
                
              
                            
                2021-01-06 03:06
              
            
            
                                                                       
Here's a strategy - order by start time, then unlist the data by going start,end,start,end,... and see if that vector needs to be reordered. If it doesn't, then there are no conflicts and if it does you can see how many appointments (and which appointments if you like) conflict with each other.

# Using Roland's example:
DF <- read.table(text="                Start,                 End,  Location,  Room
1,2012-01-02 08:30:00,2012-01-02 08:40:00,LocationA,RoomA
2,2012-01-02 08:40:00,2012-01-02 08:50:00,LocationA,RoomA
3,2012-01-02 08:50:00,2012-01-02 09:55:00,LocationA,RoomA
4,2012-01-02 09:00:00,2012-01-02 09:10:00,LocationA,RoomA
5,2012-01-02 09:00:00,2012-01-02 09:10:00,LocationA,RoomB
6,2012-01-02 09:10:00,2012-01-02 09:20:00,LocationA,RoomB",header=TRUE,sep=",",stringsAsFactors=FALSE)

dt = data.table(DF)

# the conflicting appointments
dt[order(Start),
   .SD[unique((which(order(c(rbind(Start, End))) != 1:(2*.N)) - 1) %/% 2 + 1)],
   by = list(Location, Room)]
#    Location  Room               Start                 End
#1: LocationA RoomA 2012-01-02 08:50:00 2012-01-02 09:55:00
#2: LocationA RoomA 2012-01-02 09:00:00 2012-01-02 09:10:00

# and a speedier version of the above, that avoids constructing the full .SD:
dt[dt[order(Start),
      .I[unique((which(order(c(rbind(Start, End))) != 1:(2*.N)) - 1) %/% 2 + 1)],
      by = list(Location, Room)]$V1]


Perhaps the formula for going from unmatched order to correct indices above can be simplified, I didn't spend too much time thinking about it and just used the first thing that got the job done.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  不要未来只要你来        
                
              
                            
                2021-01-06 03:09
              
            
            
                                                                       
I am not exactly sure, if I understand your goal. Still, this might be of use:

#I changed the example to actually have concurrent appointments
DF <- read.table(text="                Start,                 End,  Location,  Room
1, 2012-01-02 08:30:00, 2012-01-02 08:40:00, LocationA, RoomA
2, 2012-01-02 08:40:00, 2012-01-02 08:50:00, LocationA, RoomA
3, 2012-01-02 08:50:00, 2012-01-02 09:55:00, LocationA, RoomA
4, 2012-01-02 09:00:00, 2012-01-02 09:10:00, LocationA, RoomA
5, 2012-01-02 09:00:00, 2012-01-02 09:10:00, LocationA, RoomB
6, 2012-01-02 09:10:00, 2012-01-02 09:20:00, LocationA, RoomB",header=TRUE,sep=",",stringsAsFactors=FALSE)

DF$Start <- as.POSIXct(DF$Start,format="%Y-%d-%m %H:%M:%S",tz="GMT")
DF$End <- as.POSIXct(DF$End,format="%Y-%d-%m %H:%M:%S",tz="GMT")

library(data.table)
DT <- data.table(DF)
DT[,c("Start_num","End_num"):=lapply(.SD,as.numeric),.SDcols=1:2]

fun <- function(s,e) {
  require(intervals)
  mat <- cbind(s,e)
  inter <- Intervals(mat,closed=c(FALSE,FALSE),type="R")
  io <- interval_overlap( inter, inter )
  tablengths <- table(sapply(io,length))[-1]
  sum(c(0,as.vector(tablengths/as.integer(names(tablengths)))))
}

#number of overlapping events per room and location
DT[,fun(Start_num,End_num),by=list(Location,Room)]
#     Location   Room V1
#1:  LocationA  RoomA  1
#2:  LocationA  RoomB  0


I didn't test this, especially not for speed.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复