Recoding is a common practice for survey data, but the most obvious routes take more time than they should.
The fastest code that accomplishes the same task with the pr
A data.table
answer for your consideration. We're just using setattr()
from it, which works on data.frame
, and columns of data.frame
. No need to convert to data.table
.
The test data again :
dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1L,2L,4L,5L,3L),50000))
dat <- cbind(dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat)
dat <- as.data.frame(dat)
re.codes <- c("This","That","And","The","Other")
Now change the class and set the levels of each column directly, by reference :
require(data.table)
system.time(for (i in 1:ncol(dat)) {
setattr(dat[[i]],"levels",re.codes)
setattr(dat[[i]],"class","factor")
}
# user system elapsed
# 0 0 0
identical(dat, )
# [1] TRUE
Does 0.00 win? As you increase the size of the data, this method stays at 0.00.
Ok, I admit, I changed the input data slightly to be integer
for all columns (the question has double
input data in a third of the columns). Those double
columns have to be converted to integer
because factor
is only valid for integer
vectors. As mentioned in the other answers.
So, strictly with the input data in the question, and including the double
to integer
conversion :
dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1,2,4,5,3),50000))
dat <- cbind(dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat)
dat <- as.data.frame(dat)
re.codes <- c("This","That","And","The","Other")
system.time(for (i in 1:ncol(dat)) {
if (!is.integer(dat[[i]]))
set(dat,j=i,value=as.integer(dat[[i]]))
setattr(dat[[i]],"levels",re.codes)
setattr(dat[[i]],"class","factor")
})
# user system elapsed
# 0.06 0.01 0.08 # on my slow netbook
identical(dat, )
# [1] TRUE
Note that set
also works on data.frame
, too. You don't have to convert to data.table
to use it.
These are very small times, clearly. Since it's only a small input dataset :
dim(dat)
# [1] 250000 36
object.size(dat)
# 68.7 Mb
Scaling up from this should reveal larger differences. But even so I think it should be (just about) measurably fastest. Not a significant difference that anyone minds about, at this size, though.
The setattr
function is also in the bit
package, btw. So the 0.00 method can be done with either data.table
or bit
. To do the type conversion by reference (if required) either set
or :=
(both in data.table
) is needed, afaik.