fast subsetting in R

后端 未结 5 489
情书的邮戳
情书的邮戳 2021-02-03 14:00

I\'ve got a dataframe dat of size 30000 x 50. I also have a separate list that contains points to groupings of rows from this dataframe, e.g.,

rows <- list(c(         


        
5条回答
  •  -上瘾入骨i
    2021-02-03 14:25

    Here's one attempt at a speedup - it hinges on the fact that it is faster to look up a row index than to look up a row name, and so tries to make a mapping of rowname to rownumber in dat.

    First create some data of the same size as yours and assign some numeric rownames:

    > dat <- data.frame(matrix(runif(30000*50),ncol=50))
    > rownames(dat) <- as.character(sample.int(nrow(dat)))
    > rownames(dat)[1:5]
    [1] "21889" "3050"  "22570" "28140" "9576" 
    

    Now generate a random rows with 15000 elements, each of 50 random numbers from 1 to 30000 (being row*names* in this case):

    # 15000 groups of up to 50 rows each
    > rows <- sapply(1:15000, function(i) as.character(sample.int(30000,size=sample.int(50,size=1))))
    

    For timing purposes, try the method in your question (ouch!):

    # method 1
    > system.time((res1 <- lapply(rows,function(r) dat[r,])))
       user  system elapsed 
    182.306   0.877 188.362 
    

    Now, try to make a mapping from row name to row number. map[i] should give the row number with name i.

    FIRST if your row names are a permutation of 1:nrow(dat) you're in luck! All you have to do is sort the rownames, and return the indices:

    > map <- sort(as.numeric(rownames(dat)), index.return=T)$ix
    # NOTE: map[ as.numeric(rowname) ] -> rownumber into dat for that rowname.
    

    Now look up row indices instead of row names:

    > system.time((res2 <- lapply(rows,function(r) dat[map[as.numeric(r)],])))
       user  system elapsed
     32.424   0.060  33.050
    

    Check we didn't screw anything up (note it is sufficient to match the rownames since rownames are unique in R):

    > all(rownames(res1)==rownames(res2))
    [1] TRUE
    

    So, a ~6x speedup. Still not amazing though...

    SECOND If you're unlucky and your rownames are not at all related to nrow(dat), you could try this, but only if max(as.numeric(rownames(dat))) is not too much bigger than nrow(dat). It basically makes map with map[rowname] giving the row number, but since the rownames are not necessarily continuous any more there can be heaps of gaps in map which wastes a bit of memory:

    map <- rep(-1,max(as.numeric(rownames(dat))))
    obj <- sort(as.numeric(rownames(dat)), index.return=T)
    map[obj$x] <- obj$ix
    

    Then use map as before (dat[map[as.numeric(r),]]).

提交回复
热议问题