Challenge: recoding a data.frame() — make it faster

后端未结

关注

 6  1691

借酒劲吻你

Recoding is a common practice for survey data, but the most obvious routes take more time than they should.

The fastest code that accomplishes the same task with the pr

相关标签:

6条回答

不知归路

2021-02-04 06:45
Making factors is expensive; only doing it once is comparable with the commands using structure, and in my opinion, preferable as you don't have to depend on how factors happen to be constructed.
```
rc <- factor(re.codes, levels=re.codes)
dat5 <- as.data.frame(lapply(dat, function(d) rc[d]))
```
EDIT 2: Interestingly, this seems to be a case where lapply does speed things up. This for loop is substantially slower.
```
for(i in seq_along(dat)) {
  dat[[i]] <- rc[dat[[i]]]
}
```
EDIT 1: You can also speed things up by being more precise with your types. Try any of the solutions (but especially your original one) creating your data as integers, as follows. For details, see a previous answer of mine here.
```
dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1L,2L,4L,5L,3L),50000))
```
This is also a good idea as converting to integers from floating points, as is being done in all of the faster solutions here, can give unexpected behavior, see this question.
0 讨论(0)
发布评论:

提交评论
- 加载中...
-上瘾入骨i

2021-02-04 06:46
Try this:
```
m <- as.matrix(dat)

dat <- data.frame( matrix( re.codes[m], nrow = nrow(m)))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

醉梦人生

2021-02-04 06:50

Combining @DWin's answer, and my answer from Most efficient list to data.frame method?:

system.time({
  dat3 <- list()
  # define attributes once outside of loop
  attrib <- list(class="factor", levels=re.codes)
  for (i in names(dat)) {              # loop over each column in 'dat'
    dat3[[i]] <- as.integer(dat[[i]])  # convert column to integer
    attributes(dat3[[i]]) <- attrib    # assign factor attributes
  }
  # convert 'dat3' into a data.frame. We can do it like this because:
  # 1) we know 'dat' and 'dat3' have the same number of rows and columns
  # 2) we want 'dat3' to have the same colnames as 'dat'
  # 3) we don't care if 'dat3' has different rownames than 'dat'
  attributes(dat3) <- list(row.names=c(NA_integer_,nrow(dat)),
    class="data.frame", names=names(dat))
})
identical(dat2, dat3)  # 'dat2' is from @Dwin's answer

0 讨论(0)

一整个雨季

2021-02-04 06:52

The help page for class() says that class<- is deprecated and to use as. methods. I haven't quite figured out why the earlier effort was reporting 0 observations when the data was obviously in the object, but this method results in a complete object:

    system.time({ dat2 <- vector(mode="list", length(dat))
      for (i in 1:length(dat) ){ dat2[[i]] <- dat[[i]]
        storage.mode(dat2[[i]]) <- "integer"
               attributes(dat2[[i]]) <- list(class="factor", levels=re.codes)}
  names(dat2) <- names(dat)
  dat2 <- as.data.frame(dat2)})
#--------------------------  
  user  system elapsed 
  0.266   0.290   0.560 
> str(dat2)
'data.frame':   250000 obs. of  36 variables:
 $ V1 : Factor w/ 5 levels "This","That",..: 1 2 3 4 5 1 2 3 4 5 ...
 $ V2 : Factor w/ 5 levels "This","That",..: 5 4 3 2 1 5 4 3 2 1 ...
 $ V3 : Factor w/ 5 levels "This","That",..: 1 2 4 5 3 1 2 4 5 3 ...
 $ V4 : Factor w/ 5 levels "This","That",..: 1 2 3 4 5 1 2 3 4 5 ...
 $ V5 : Factor w/ 5 levels "This","That",..: 5 4 3 2 1 5 4 3 2 1 ...
 $ V6 : Factor w/ 5 levels "This","That",..: 1 2 4 5 3 1 2 4 5 3 ...
 $ V7 : Factor w/ 5 levels "This","That",..: 1 2 3 4 5 1 2 3 4 5 ...
 $ V8 : Factor w/ 5 levels "This","That",..: 5 4 3 2 1 5 4 3 2 1 ...
 snipped

All 36 columns are there.

0 讨论(0)

慢半拍i

2021-02-04 06:54
A data.table answer for your consideration. We're just using setattr() from it, which works on data.frame, and columns of data.frame. No need to convert to data.table.

The test data again :
```
dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1L,2L,4L,5L,3L),50000)) 
dat <- cbind(dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat) 
dat <- as.data.frame(dat) 
re.codes <- c("This","That","And","The","Other") 
```
Now change the class and set the levels of each column directly, by reference :
```
require(data.table)
system.time(for (i in 1:ncol(dat)) {
  setattr(dat[[i]],"levels",re.codes)
  setattr(dat[[i]],"class","factor")
}
# user  system elapsed 
#   0       0       0 

identical(dat, <result in question>)
# [1] TRUE
```
Does 0.00 win? As you increase the size of the data, this method stays at 0.00.

Ok, I admit, I changed the input data slightly to be integer for all columns (the question has double input data in a third of the columns). Those double columns have to be converted to integer because factor is only valid for integer vectors. As mentioned in the other answers.

So, strictly with the input data in the question, and including the double to integer conversion :
```
dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1,2,4,5,3),50000))             
dat <- cbind(dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat)               
dat <- as.data.frame(dat)               
re.codes <- c("This","That","And","The","Other")           

system.time(for (i in 1:ncol(dat)) {
  if (!is.integer(dat[[i]]))
      set(dat,j=i,value=as.integer(dat[[i]]))
  setattr(dat[[i]],"levels",re.codes)
  setattr(dat[[i]],"class","factor")
})
#  user  system elapsed
#  0.06    0.01    0.08      # on my slow netbook

identical(dat, <result in question>)
# [1] TRUE
```
Note that set also works on data.frame, too. You don't have to convert to data.table to use it.

These are very small times, clearly. Since it's only a small input dataset :
```
dim(dat)
# [1] 250000     36 
object.size(dat)
# 68.7 Mb
```
Scaling up from this should reveal larger differences. But even so I think it should be (just about) measurably fastest. Not a significant difference that anyone minds about, at this size, though.

The setattr function is also in the bit package, btw. So the 0.00 method can be done with either data.table or bit. To do the type conversion by reference (if required) either set or := (both in data.table) is needed, afaik.
0 讨论(0)
发布评论:

提交评论
- 加载中...

-上瘾入骨i

2021-02-04 06:56

My computer is obviously much slower, but structure is a pretty fast way to do this:

> system.time({
+ dat1 <- dat
+ for(x in 1:ncol(dat)) {
+   dat1[,x] <- factor(dat1[,x], labels=re.codes)
+   }
+ })
   user  system elapsed 
 11.965   3.172  15.164 
> 
> system.time({
+ m <- as.matrix(dat)
+ dat2 <- data.frame( matrix( re.codes[m], nrow = nrow(m)))
+ })
   user  system elapsed 
  2.100   0.516   2.621 
> 
> system.time(dat3 <- data.frame(lapply(dat, structure, class='factor', levels=re.codes)))
   user  system elapsed 
  0.484   0.332   0.820 

# this isn't because the levels get re-ordered
> all.equal(dat1, dat2)

> all.equal(dat1, dat3)
[1] TRUE

0 讨论(0)