fast subsetting in R

后端未结

关注

 5  504

I\'ve got a dataframe dat of size 30000 x 50. I also have a separate list that contains points to groupings of rows from this dataframe, e.g.,

rows <- list(c(


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  南旧        
                
              
                            
                2021-02-03 14:24
              
            
            
                                                                       
Update

My original post started with this erroneous statement:


  The problem with indexing via rownames and colnames is that you
  are running a vector/linear scan for each element, eg. you are hunting
  through each row to see which is named "36", then starting from the
  beginning to do it again for "34".


Simon pointed out in the comments here that R apparently uses a hash table for indexing. Sorry for the mistake.

Original Answer

Note that the suggestions in this answer assume that you have non-overlapping subsets of data.

If you want to keep your list-lookup strategy, I'd suggest storing the actual row indices in stead of string names.

An alternative is to store your "group" information as another column to your data.frame, then split your data.frame on its group, eg. let's say your recoded data.frame looks like this:

dat <- data.frame(a=sample(100, 10),
                  b=rnorm(10),
                  group=sample(c('a', 'b', 'c'), 10, replace=TRUE))


You could then do:

split(dat, dat$group)
$a
   a           b group
2 66 -0.08721261     a
9 62 -1.34114792     a

$b
    a          b group
1  32  0.9719442     b
5  79 -1.0204179     b
6  83 -1.7645829     b
7  73  0.4261097     b
10 44 -0.1160913     b

$c
   a          b group
3 77  0.2313654     c
4 74 -0.8637770     c
8 29  1.0046095     c


Or, depending on what you really want to do with your "splits", you can convert your data.frame to a data.table and set its key to your new group column:

library(data.table)
dat <- data.table(dat, key="group")


Now do your list thing -- which will give you the same result as the split above

 x <- lapply(unique(dat$group), function(g) dat[J(g),])


But you probably want to "work over your spits", and you can do that inline, eg:

ans <- dat[, {
  ## do some code over the data in each split
  ## and return a list of results, eg:
  list(nrow=length(a), mean.a=mean(a), mean.b=mean(b))
}, by="group"]

ans
     group nrow mean.a     mean.b
[1,]     a    2   64.0 -0.7141803
[2,]     b    5   62.2 -0.3006076
[3,]     c    3   60.0  0.1240660


You can do the last step in "a similar fashion" with plyr, eg:

library(plyr)
ddply(dat, "group", summarize, nrow=length(a), mean.a=mean(a),
      mean.b=mean(b))
  group nrow mean.a     mean.b
1     a    2   64.0 -0.7141803
2     b    5   62.2 -0.3006076
3     c    3   60.0  0.1240660


But since you mention your dataset is quite large, I think you'd like the speed boost data.table will provide.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  -上瘾入骨i        
                
              
                            
                2021-02-03 14:25
              
            
            
                                                                       
Here's one attempt at a speedup - it hinges on the fact that it is faster to look up a row index than to look up a row name, and so tries to make a mapping of rowname to rownumber in dat.

First create some data of the same size as yours and assign some numeric rownames:

> dat <- data.frame(matrix(runif(30000*50),ncol=50))
> rownames(dat) <- as.character(sample.int(nrow(dat)))
> rownames(dat)[1:5]
[1] "21889" "3050"  "22570" "28140" "9576" 


Now generate a random rows with 15000 elements, each of 50 random numbers from 1 to 30000 (being row*names* in this case):

# 15000 groups of up to 50 rows each
> rows <- sapply(1:15000, function(i) as.character(sample.int(30000,size=sample.int(50,size=1))))


For timing purposes, try the method in your question (ouch!):

# method 1
> system.time((res1 <- lapply(rows,function(r) dat[r,])))
   user  system elapsed 
182.306   0.877 188.362 


Now, try to make a mapping from row name to row number. map[i] should give the row number with name i.

FIRST if your row names are a permutation of 1:nrow(dat) you're in luck! All you have to do is sort the rownames, and return the indices:

> map <- sort(as.numeric(rownames(dat)), index.return=T)$ix
# NOTE: map[ as.numeric(rowname) ] -> rownumber into dat for that rowname.


Now look up row indices instead of row names:

> system.time((res2 <- lapply(rows,function(r) dat[map[as.numeric(r)],])))
   user  system elapsed
 32.424   0.060  33.050


Check we didn't screw anything up (note it is sufficient to match the rownames since rownames are unique in R):

> all(rownames(res1)==rownames(res2))
[1] TRUE


So, a ~6x speedup. Still not amazing though...

SECOND If you're unlucky and your rownames are not at all related to nrow(dat), you could try this, but only if max(as.numeric(rownames(dat))) is not too much bigger than nrow(dat). It basically makes map with map[rowname] giving the row number, but since the rownames are not necessarily continuous any more there can be heaps of gaps in map which wastes a bit of memory:

map <- rep(-1,max(as.numeric(rownames(dat))))
obj <- sort(as.numeric(rownames(dat)), index.return=T)
map[obj$x] <- obj$ix


Then use map as before (dat[map[as.numeric(r),]]).
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  野趣味        
                
              
                            
                2021-02-03 14:26
              
            
            
                                                                       
One of the main issues is the matching of row names -- the default in [.data.frame is partial matching of row names and you probably don't want that, so you're better off with match. To speed it up even further you can use fmatch from fastmatch if you want. This is a minor modification with some speedup:

# naive
> system.time(res1 <- lapply(rows,function(r) dat[r,]))
   user  system elapsed 
 69.207   5.545  74.787 

# match
> rn <- rownames(dat)
> system.time(res1 <- lapply(rows,function(r) dat[match(r,rn),]))
   user  system elapsed 
 36.810  10.003  47.082 

# fastmatch
> rn <- rownames(dat)
> system.time(res1 <- lapply(rows,function(r) dat[fmatch(r,rn),]))
   user  system elapsed 
 19.145   3.012  22.226 


You can get further speed up by not using [ (it is slow for data frames) but splitting the data frame (using split) if your rows are non-overlapping and cover all rows (and thus you can map each row to one entry in rows).

Depending on your actual data you may be better off with matrices that have by far faster subsetting operators since they are native.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  时光取名叫无心        
                
              
                            
                2021-02-03 14:40
              
            
            
                                                                       
I agree with mathematical coffee that I too get fast times for this.

Don't know if it's possible but by unlisting as a vector and then converting to numeric you can get a speed boost.

dat <- data.frame(matrix(rnorm(30000*50), 30000, 50 ))
rows <- as.numeric(unlist(list(c("34", "36", "39"), c("45", "46"))))
system.time(lapply(rows, function(r) {dat[r, ]}))


EDIT: 

dat$observ <- rownames(dat)
rownames(dat) <- 1:nrow(dat)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  孤街浪徒        
                
              
                            
                2021-02-03 14:41
              
            
            
                                                                       
You could try this modification:

system.time(lapply(rows, function(r) {dat[ rownames(dat) %in% r, ]}))

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复