fast subsetting in R

后端 未结 5 488
情书的邮戳
情书的邮戳 2021-02-03 14:00

I\'ve got a dataframe dat of size 30000 x 50. I also have a separate list that contains points to groupings of rows from this dataframe, e.g.,

rows <- list(c(         


        
相关标签:
5条回答
  • 2021-02-03 14:24

    Update

    My original post started with this erroneous statement:

    The problem with indexing via rownames and colnames is that you are running a vector/linear scan for each element, eg. you are hunting through each row to see which is named "36", then starting from the beginning to do it again for "34".

    Simon pointed out in the comments here that R apparently uses a hash table for indexing. Sorry for the mistake.

    Original Answer

    Note that the suggestions in this answer assume that you have non-overlapping subsets of data.

    If you want to keep your list-lookup strategy, I'd suggest storing the actual row indices in stead of string names.

    An alternative is to store your "group" information as another column to your data.frame, then split your data.frame on its group, eg. let's say your recoded data.frame looks like this:

    dat <- data.frame(a=sample(100, 10),
                      b=rnorm(10),
                      group=sample(c('a', 'b', 'c'), 10, replace=TRUE))
    

    You could then do:

    split(dat, dat$group)
    $a
       a           b group
    2 66 -0.08721261     a
    9 62 -1.34114792     a
    
    $b
        a          b group
    1  32  0.9719442     b
    5  79 -1.0204179     b
    6  83 -1.7645829     b
    7  73  0.4261097     b
    10 44 -0.1160913     b
    
    $c
       a          b group
    3 77  0.2313654     c
    4 74 -0.8637770     c
    8 29  1.0046095     c
    

    Or, depending on what you really want to do with your "splits", you can convert your data.frame to a data.table and set its key to your new group column:

    library(data.table)
    dat <- data.table(dat, key="group")
    

    Now do your list thing -- which will give you the same result as the split above

     x <- lapply(unique(dat$group), function(g) dat[J(g),])
    

    But you probably want to "work over your spits", and you can do that inline, eg:

    ans <- dat[, {
      ## do some code over the data in each split
      ## and return a list of results, eg:
      list(nrow=length(a), mean.a=mean(a), mean.b=mean(b))
    }, by="group"]
    
    ans
         group nrow mean.a     mean.b
    [1,]     a    2   64.0 -0.7141803
    [2,]     b    5   62.2 -0.3006076
    [3,]     c    3   60.0  0.1240660
    

    You can do the last step in "a similar fashion" with plyr, eg:

    library(plyr)
    ddply(dat, "group", summarize, nrow=length(a), mean.a=mean(a),
          mean.b=mean(b))
      group nrow mean.a     mean.b
    1     a    2   64.0 -0.7141803
    2     b    5   62.2 -0.3006076
    3     c    3   60.0  0.1240660
    

    But since you mention your dataset is quite large, I think you'd like the speed boost data.table will provide.

    0 讨论(0)
  • 2021-02-03 14:25

    Here's one attempt at a speedup - it hinges on the fact that it is faster to look up a row index than to look up a row name, and so tries to make a mapping of rowname to rownumber in dat.

    First create some data of the same size as yours and assign some numeric rownames:

    > dat <- data.frame(matrix(runif(30000*50),ncol=50))
    > rownames(dat) <- as.character(sample.int(nrow(dat)))
    > rownames(dat)[1:5]
    [1] "21889" "3050"  "22570" "28140" "9576" 
    

    Now generate a random rows with 15000 elements, each of 50 random numbers from 1 to 30000 (being row*names* in this case):

    # 15000 groups of up to 50 rows each
    > rows <- sapply(1:15000, function(i) as.character(sample.int(30000,size=sample.int(50,size=1))))
    

    For timing purposes, try the method in your question (ouch!):

    # method 1
    > system.time((res1 <- lapply(rows,function(r) dat[r,])))
       user  system elapsed 
    182.306   0.877 188.362 
    

    Now, try to make a mapping from row name to row number. map[i] should give the row number with name i.

    FIRST if your row names are a permutation of 1:nrow(dat) you're in luck! All you have to do is sort the rownames, and return the indices:

    > map <- sort(as.numeric(rownames(dat)), index.return=T)$ix
    # NOTE: map[ as.numeric(rowname) ] -> rownumber into dat for that rowname.
    

    Now look up row indices instead of row names:

    > system.time((res2 <- lapply(rows,function(r) dat[map[as.numeric(r)],])))
       user  system elapsed
     32.424   0.060  33.050
    

    Check we didn't screw anything up (note it is sufficient to match the rownames since rownames are unique in R):

    > all(rownames(res1)==rownames(res2))
    [1] TRUE
    

    So, a ~6x speedup. Still not amazing though...

    SECOND If you're unlucky and your rownames are not at all related to nrow(dat), you could try this, but only if max(as.numeric(rownames(dat))) is not too much bigger than nrow(dat). It basically makes map with map[rowname] giving the row number, but since the rownames are not necessarily continuous any more there can be heaps of gaps in map which wastes a bit of memory:

    map <- rep(-1,max(as.numeric(rownames(dat))))
    obj <- sort(as.numeric(rownames(dat)), index.return=T)
    map[obj$x] <- obj$ix
    

    Then use map as before (dat[map[as.numeric(r),]]).

    0 讨论(0)
  • 2021-02-03 14:26

    One of the main issues is the matching of row names -- the default in [.data.frame is partial matching of row names and you probably don't want that, so you're better off with match. To speed it up even further you can use fmatch from fastmatch if you want. This is a minor modification with some speedup:

    # naive
    > system.time(res1 <- lapply(rows,function(r) dat[r,]))
       user  system elapsed 
     69.207   5.545  74.787 
    
    # match
    > rn <- rownames(dat)
    > system.time(res1 <- lapply(rows,function(r) dat[match(r,rn),]))
       user  system elapsed 
     36.810  10.003  47.082 
    
    # fastmatch
    > rn <- rownames(dat)
    > system.time(res1 <- lapply(rows,function(r) dat[fmatch(r,rn),]))
       user  system elapsed 
     19.145   3.012  22.226 
    

    You can get further speed up by not using [ (it is slow for data frames) but splitting the data frame (using split) if your rows are non-overlapping and cover all rows (and thus you can map each row to one entry in rows).

    Depending on your actual data you may be better off with matrices that have by far faster subsetting operators since they are native.

    0 讨论(0)
  • 2021-02-03 14:40

    I agree with mathematical coffee that I too get fast times for this.

    Don't know if it's possible but by unlisting as a vector and then converting to numeric you can get a speed boost.

    dat <- data.frame(matrix(rnorm(30000*50), 30000, 50 ))
    rows <- as.numeric(unlist(list(c("34", "36", "39"), c("45", "46"))))
    system.time(lapply(rows, function(r) {dat[r, ]}))
    

    EDIT:

    dat$observ <- rownames(dat)
    rownames(dat) <- 1:nrow(dat)
    
    0 讨论(0)
  • 2021-02-03 14:41

    You could try this modification:

    system.time(lapply(rows, function(r) {dat[ rownames(dat) %in% r, ]}))
    
    0 讨论(0)
提交回复
热议问题