fast subsetting in R

后端 未结 5 498
情书的邮戳
情书的邮戳 2021-02-03 14:00

I\'ve got a dataframe dat of size 30000 x 50. I also have a separate list that contains points to groupings of rows from this dataframe, e.g.,

rows <- list(c(         


        
5条回答
  •  野趣味
    野趣味 (楼主)
    2021-02-03 14:26

    One of the main issues is the matching of row names -- the default in [.data.frame is partial matching of row names and you probably don't want that, so you're better off with match. To speed it up even further you can use fmatch from fastmatch if you want. This is a minor modification with some speedup:

    # naive
    > system.time(res1 <- lapply(rows,function(r) dat[r,]))
       user  system elapsed 
     69.207   5.545  74.787 
    
    # match
    > rn <- rownames(dat)
    > system.time(res1 <- lapply(rows,function(r) dat[match(r,rn),]))
       user  system elapsed 
     36.810  10.003  47.082 
    
    # fastmatch
    > rn <- rownames(dat)
    > system.time(res1 <- lapply(rows,function(r) dat[fmatch(r,rn),]))
       user  system elapsed 
     19.145   3.012  22.226 
    

    You can get further speed up by not using [ (it is slow for data frames) but splitting the data frame (using split) if your rows are non-overlapping and cover all rows (and thus you can map each row to one entry in rows).

    Depending on your actual data you may be better off with matrices that have by far faster subsetting operators since they are native.

提交回复
热议问题