Challenge: recoding a data.frame() — make it faster

后端未结

关注

 6  1692

借酒劲吻你 2021-02-04 06:43

Recoding is a common practice for survey data, but the most obvious routes take more time than they should.

The fastest code that accomplishes the same task with the pr

6条回答

慢半拍i (楼主)

2021-02-04 06:54
A data.table answer for your consideration. We're just using setattr() from it, which works on data.frame, and columns of data.frame. No need to convert to data.table.

The test data again :
```
dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1L,2L,4L,5L,3L),50000)) 
dat <- cbind(dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat) 
dat <- as.data.frame(dat) 
re.codes <- c("This","That","And","The","Other") 
```
Now change the class and set the levels of each column directly, by reference :
```
require(data.table)
system.time(for (i in 1:ncol(dat)) {
  setattr(dat[[i]],"levels",re.codes)
  setattr(dat[[i]],"class","factor")
}
# user  system elapsed 
#   0       0       0 

identical(dat, )
# [1] TRUE
```
Does 0.00 win? As you increase the size of the data, this method stays at 0.00.

Ok, I admit, I changed the input data slightly to be integer for all columns (the question has double input data in a third of the columns). Those double columns have to be converted to integer because factor is only valid for integer vectors. As mentioned in the other answers.

So, strictly with the input data in the question, and including the double to integer conversion :
```
dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1,2,4,5,3),50000))             
dat <- cbind(dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat)               
dat <- as.data.frame(dat)               
re.codes <- c("This","That","And","The","Other")           

system.time(for (i in 1:ncol(dat)) {
  if (!is.integer(dat[[i]]))
      set(dat,j=i,value=as.integer(dat[[i]]))
  setattr(dat[[i]],"levels",re.codes)
  setattr(dat[[i]],"class","factor")
})
#  user  system elapsed
#  0.06    0.01    0.08      # on my slow netbook

identical(dat, )
# [1] TRUE
```
Note that set also works on data.frame, too. You don't have to convert to data.table to use it.

These are very small times, clearly. Since it's only a small input dataset :
```
dim(dat)
# [1] 250000     36 
object.size(dat)
# 68.7 Mb
```
Scaling up from this should reveal larger differences. But even so I think it should be (just about) measurably fastest. Not a significant difference that anyone minds about, at this size, though.

The setattr function is also in the bit package, btw. So the 0.00 method can be done with either data.table or bit. To do the type conversion by reference (if required) either set or := (both in data.table) is needed, afaik.
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...