Recoding is a common practice for survey data, but the most obvious routes take more time than they should.
The fastest code that accomplishes the same task with the pr
Making factors is expensive; only doing it once is comparable with the commands using structure
, and in my opinion, preferable as you don't have to depend on how factors happen to be constructed.
rc <- factor(re.codes, levels=re.codes)
dat5 <- as.data.frame(lapply(dat, function(d) rc[d]))
EDIT 2: Interestingly, this seems to be a case where lapply
does speed things up. This for loop is substantially slower.
for(i in seq_along(dat)) {
dat[[i]] <- rc[dat[[i]]]
}
EDIT 1: You can also speed things up by being more precise with your types. Try any of the solutions (but especially your original one) creating your data as integers, as follows. For details, see a previous answer of mine here.
dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1L,2L,4L,5L,3L),50000))
This is also a good idea as converting to integers from floating points, as is being done in all of the faster solutions here, can give unexpected behavior, see this question.
Try this:
m <- as.matrix(dat)
dat <- data.frame( matrix( re.codes[m], nrow = nrow(m)))
Combining @DWin's answer, and my answer from Most efficient list to data.frame method?:
system.time({
dat3 <- list()
# define attributes once outside of loop
attrib <- list(class="factor", levels=re.codes)
for (i in names(dat)) { # loop over each column in 'dat'
dat3[[i]] <- as.integer(dat[[i]]) # convert column to integer
attributes(dat3[[i]]) <- attrib # assign factor attributes
}
# convert 'dat3' into a data.frame. We can do it like this because:
# 1) we know 'dat' and 'dat3' have the same number of rows and columns
# 2) we want 'dat3' to have the same colnames as 'dat'
# 3) we don't care if 'dat3' has different rownames than 'dat'
attributes(dat3) <- list(row.names=c(NA_integer_,nrow(dat)),
class="data.frame", names=names(dat))
})
identical(dat2, dat3) # 'dat2' is from @Dwin's answer
The help page for class() says that class<- is deprecated and to use as. methods. I haven't quite figured out why the earlier effort was reporting 0 observations when the data was obviously in the object, but this method results in a complete object:
system.time({ dat2 <- vector(mode="list", length(dat))
for (i in 1:length(dat) ){ dat2[[i]] <- dat[[i]]
storage.mode(dat2[[i]]) <- "integer"
attributes(dat2[[i]]) <- list(class="factor", levels=re.codes)}
names(dat2) <- names(dat)
dat2 <- as.data.frame(dat2)})
#--------------------------
user system elapsed
0.266 0.290 0.560
> str(dat2)
'data.frame': 250000 obs. of 36 variables:
$ V1 : Factor w/ 5 levels "This","That",..: 1 2 3 4 5 1 2 3 4 5 ...
$ V2 : Factor w/ 5 levels "This","That",..: 5 4 3 2 1 5 4 3 2 1 ...
$ V3 : Factor w/ 5 levels "This","That",..: 1 2 4 5 3 1 2 4 5 3 ...
$ V4 : Factor w/ 5 levels "This","That",..: 1 2 3 4 5 1 2 3 4 5 ...
$ V5 : Factor w/ 5 levels "This","That",..: 5 4 3 2 1 5 4 3 2 1 ...
$ V6 : Factor w/ 5 levels "This","That",..: 1 2 4 5 3 1 2 4 5 3 ...
$ V7 : Factor w/ 5 levels "This","That",..: 1 2 3 4 5 1 2 3 4 5 ...
$ V8 : Factor w/ 5 levels "This","That",..: 5 4 3 2 1 5 4 3 2 1 ...
snipped
All 36 columns are there.
A data.table
answer for your consideration. We're just using setattr()
from it, which works on data.frame
, and columns of data.frame
. No need to convert to data.table
.
The test data again :
dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1L,2L,4L,5L,3L),50000))
dat <- cbind(dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat)
dat <- as.data.frame(dat)
re.codes <- c("This","That","And","The","Other")
Now change the class and set the levels of each column directly, by reference :
require(data.table)
system.time(for (i in 1:ncol(dat)) {
setattr(dat[[i]],"levels",re.codes)
setattr(dat[[i]],"class","factor")
}
# user system elapsed
# 0 0 0
identical(dat, <result in question>)
# [1] TRUE
Does 0.00 win? As you increase the size of the data, this method stays at 0.00.
Ok, I admit, I changed the input data slightly to be integer
for all columns (the question has double
input data in a third of the columns). Those double
columns have to be converted to integer
because factor
is only valid for integer
vectors. As mentioned in the other answers.
So, strictly with the input data in the question, and including the double
to integer
conversion :
dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1,2,4,5,3),50000))
dat <- cbind(dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat)
dat <- as.data.frame(dat)
re.codes <- c("This","That","And","The","Other")
system.time(for (i in 1:ncol(dat)) {
if (!is.integer(dat[[i]]))
set(dat,j=i,value=as.integer(dat[[i]]))
setattr(dat[[i]],"levels",re.codes)
setattr(dat[[i]],"class","factor")
})
# user system elapsed
# 0.06 0.01 0.08 # on my slow netbook
identical(dat, <result in question>)
# [1] TRUE
Note that set
also works on data.frame
, too. You don't have to convert to data.table
to use it.
These are very small times, clearly. Since it's only a small input dataset :
dim(dat)
# [1] 250000 36
object.size(dat)
# 68.7 Mb
Scaling up from this should reveal larger differences. But even so I think it should be (just about) measurably fastest. Not a significant difference that anyone minds about, at this size, though.
The setattr
function is also in the bit
package, btw. So the 0.00 method can be done with either data.table
or bit
. To do the type conversion by reference (if required) either set
or :=
(both in data.table
) is needed, afaik.
My computer is obviously much slower, but structure is a pretty fast way to do this:
> system.time({
+ dat1 <- dat
+ for(x in 1:ncol(dat)) {
+ dat1[,x] <- factor(dat1[,x], labels=re.codes)
+ }
+ })
user system elapsed
11.965 3.172 15.164
>
> system.time({
+ m <- as.matrix(dat)
+ dat2 <- data.frame( matrix( re.codes[m], nrow = nrow(m)))
+ })
user system elapsed
2.100 0.516 2.621
>
> system.time(dat3 <- data.frame(lapply(dat, structure, class='factor', levels=re.codes)))
user system elapsed
0.484 0.332 0.820
# this isn't because the levels get re-ordered
> all.equal(dat1, dat2)
> all.equal(dat1, dat3)
[1] TRUE