I\'ve got a dataframe dat of size 30000 x 50. I also have a separate list that contains points to groupings of rows from this dataframe, e.g.,
rows <- list(c(
My original post started with this erroneous statement:
The problem with indexing via
rownames
andcolnames
is that you are running a vector/linear scan for each element, eg. you are hunting through each row to see which is named "36", then starting from the beginning to do it again for "34".
Simon pointed out in the comments here that R apparently uses a hash table for indexing. Sorry for the mistake.
Note that the suggestions in this answer assume that you have non-overlapping subsets of data.
If you want to keep your list-lookup strategy, I'd suggest storing the actual row indices in stead of string names.
An alternative is to store your "group" information as another column to your data.frame
, then split
your data.frame
on its group, eg. let's say your recoded data.frame
looks like this:
dat <- data.frame(a=sample(100, 10),
b=rnorm(10),
group=sample(c('a', 'b', 'c'), 10, replace=TRUE))
You could then do:
split(dat, dat$group)
$a
a b group
2 66 -0.08721261 a
9 62 -1.34114792 a
$b
a b group
1 32 0.9719442 b
5 79 -1.0204179 b
6 83 -1.7645829 b
7 73 0.4261097 b
10 44 -0.1160913 b
$c
a b group
3 77 0.2313654 c
4 74 -0.8637770 c
8 29 1.0046095 c
Or, depending on what you really want to do with your "splits", you can convert your data.frame
to a data.table and set its key to your new group
column:
library(data.table)
dat <- data.table(dat, key="group")
Now do your list thing -- which will give you the same result as the split
above
x <- lapply(unique(dat$group), function(g) dat[J(g),])
But you probably want to "work over your spits", and you can do that inline, eg:
ans <- dat[, {
## do some code over the data in each split
## and return a list of results, eg:
list(nrow=length(a), mean.a=mean(a), mean.b=mean(b))
}, by="group"]
ans
group nrow mean.a mean.b
[1,] a 2 64.0 -0.7141803
[2,] b 5 62.2 -0.3006076
[3,] c 3 60.0 0.1240660
You can do the last step in "a similar fashion" with plyr, eg:
library(plyr)
ddply(dat, "group", summarize, nrow=length(a), mean.a=mean(a),
mean.b=mean(b))
group nrow mean.a mean.b
1 a 2 64.0 -0.7141803
2 b 5 62.2 -0.3006076
3 c 3 60.0 0.1240660
But since you mention your dataset is quite large, I think you'd like the speed boost data.table
will provide.
Here's one attempt at a speedup - it hinges on the fact that it is faster to look up a row index than to look up a row name, and so tries to make a mapping of rowname to rownumber in dat
.
First create some data of the same size as yours and assign some numeric rownames:
> dat <- data.frame(matrix(runif(30000*50),ncol=50))
> rownames(dat) <- as.character(sample.int(nrow(dat)))
> rownames(dat)[1:5]
[1] "21889" "3050" "22570" "28140" "9576"
Now generate a random rows
with 15000 elements, each of 50 random numbers from 1 to 30000 (being row*names* in this case):
# 15000 groups of up to 50 rows each
> rows <- sapply(1:15000, function(i) as.character(sample.int(30000,size=sample.int(50,size=1))))
For timing purposes, try the method in your question (ouch!):
# method 1
> system.time((res1 <- lapply(rows,function(r) dat[r,])))
user system elapsed
182.306 0.877 188.362
Now, try to make a mapping from row name to row number. map[i]
should give the row number with name i
.
FIRST if your row names are a permutation of 1:nrow(dat)
you're in luck! All you have to do is sort the rownames, and return the indices:
> map <- sort(as.numeric(rownames(dat)), index.return=T)$ix
# NOTE: map[ as.numeric(rowname) ] -> rownumber into dat for that rowname.
Now look up row indices instead of row names:
> system.time((res2 <- lapply(rows,function(r) dat[map[as.numeric(r)],])))
user system elapsed
32.424 0.060 33.050
Check we didn't screw anything up (note it is sufficient to match the rownames since rownames are unique in R):
> all(rownames(res1)==rownames(res2))
[1] TRUE
So, a ~6x speedup. Still not amazing though...
SECOND If you're unlucky and your rownames are not at all related to nrow(dat)
, you could try this, but only if max(as.numeric(rownames(dat)))
is not too much bigger than nrow(dat)
. It basically makes map
with map[rowname]
giving the row number, but since the rownames are not necessarily continuous any more there can be heaps of gaps in map
which wastes a bit of memory:
map <- rep(-1,max(as.numeric(rownames(dat))))
obj <- sort(as.numeric(rownames(dat)), index.return=T)
map[obj$x] <- obj$ix
Then use map
as before (dat[map[as.numeric(r),]]
).
One of the main issues is the matching of row names -- the default in [.data.frame
is partial matching of row names and you probably don't want that, so you're better off with match
. To speed it up even further you can use fmatch
from fastmatch
if you want. This is a minor modification with some speedup:
# naive
> system.time(res1 <- lapply(rows,function(r) dat[r,]))
user system elapsed
69.207 5.545 74.787
# match
> rn <- rownames(dat)
> system.time(res1 <- lapply(rows,function(r) dat[match(r,rn),]))
user system elapsed
36.810 10.003 47.082
# fastmatch
> rn <- rownames(dat)
> system.time(res1 <- lapply(rows,function(r) dat[fmatch(r,rn),]))
user system elapsed
19.145 3.012 22.226
You can get further speed up by not using [
(it is slow for data frames) but splitting the data frame (using split
) if your rows
are non-overlapping and cover all rows (and thus you can map each row to one entry in rows).
Depending on your actual data you may be better off with matrices that have by far faster subsetting operators since they are native.
I agree with mathematical coffee that I too get fast times for this.
Don't know if it's possible but by unlisting as a vector and then converting to numeric you can get a speed boost.
dat <- data.frame(matrix(rnorm(30000*50), 30000, 50 ))
rows <- as.numeric(unlist(list(c("34", "36", "39"), c("45", "46"))))
system.time(lapply(rows, function(r) {dat[r, ]}))
EDIT:
dat$observ <- rownames(dat)
rownames(dat) <- 1:nrow(dat)
You could try this modification:
system.time(lapply(rows, function(r) {dat[ rownames(dat) %in% r, ]}))