I\'ve got a dataframe dat of size 30000 x 50. I also have a separate list that contains points to groupings of rows from this dataframe, e.g.,
rows <- list(c(
Here's one attempt at a speedup - it hinges on the fact that it is faster to look up a row index than to look up a row name, and so tries to make a mapping of rowname to rownumber in dat
.
First create some data of the same size as yours and assign some numeric rownames:
> dat <- data.frame(matrix(runif(30000*50),ncol=50))
> rownames(dat) <- as.character(sample.int(nrow(dat)))
> rownames(dat)[1:5]
[1] "21889" "3050" "22570" "28140" "9576"
Now generate a random rows
with 15000 elements, each of 50 random numbers from 1 to 30000 (being row*names* in this case):
# 15000 groups of up to 50 rows each
> rows <- sapply(1:15000, function(i) as.character(sample.int(30000,size=sample.int(50,size=1))))
For timing purposes, try the method in your question (ouch!):
# method 1
> system.time((res1 <- lapply(rows,function(r) dat[r,])))
user system elapsed
182.306 0.877 188.362
Now, try to make a mapping from row name to row number. map[i]
should give the row number with name i
.
FIRST if your row names are a permutation of 1:nrow(dat)
you're in luck! All you have to do is sort the rownames, and return the indices:
> map <- sort(as.numeric(rownames(dat)), index.return=T)$ix
# NOTE: map[ as.numeric(rowname) ] -> rownumber into dat for that rowname.
Now look up row indices instead of row names:
> system.time((res2 <- lapply(rows,function(r) dat[map[as.numeric(r)],])))
user system elapsed
32.424 0.060 33.050
Check we didn't screw anything up (note it is sufficient to match the rownames since rownames are unique in R):
> all(rownames(res1)==rownames(res2))
[1] TRUE
So, a ~6x speedup. Still not amazing though...
SECOND If you're unlucky and your rownames are not at all related to nrow(dat)
, you could try this, but only if max(as.numeric(rownames(dat)))
is not too much bigger than nrow(dat)
. It basically makes map
with map[rowname]
giving the row number, but since the rownames are not necessarily continuous any more there can be heaps of gaps in map
which wastes a bit of memory:
map <- rep(-1,max(as.numeric(rownames(dat))))
obj <- sort(as.numeric(rownames(dat)), index.return=T)
map[obj$x] <- obj$ix
Then use map
as before (dat[map[as.numeric(r),]]
).