Challenge: recoding a data.frame() — make it faster

后端 未结 6 1631
借酒劲吻你
借酒劲吻你 2021-02-04 06:43

Recoding is a common practice for survey data, but the most obvious routes take more time than they should.

The fastest code that accomplishes the same task with the pr

6条回答
  •  慢半拍i
    慢半拍i (楼主)
    2021-02-04 06:54

    A data.table answer for your consideration. We're just using setattr() from it, which works on data.frame, and columns of data.frame. No need to convert to data.table.

    The test data again :

    dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1L,2L,4L,5L,3L),50000)) 
    dat <- cbind(dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat) 
    dat <- as.data.frame(dat) 
    re.codes <- c("This","That","And","The","Other") 
    

    Now change the class and set the levels of each column directly, by reference :

    require(data.table)
    system.time(for (i in 1:ncol(dat)) {
      setattr(dat[[i]],"levels",re.codes)
      setattr(dat[[i]],"class","factor")
    }
    # user  system elapsed 
    #   0       0       0 
    
    identical(dat, )
    # [1] TRUE
    

    Does 0.00 win? As you increase the size of the data, this method stays at 0.00.

    Ok, I admit, I changed the input data slightly to be integer for all columns (the question has double input data in a third of the columns). Those double columns have to be converted to integer because factor is only valid for integer vectors. As mentioned in the other answers.

    So, strictly with the input data in the question, and including the double to integer conversion :

    dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1,2,4,5,3),50000))             
    dat <- cbind(dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat)               
    dat <- as.data.frame(dat)               
    re.codes <- c("This","That","And","The","Other")           
    
    system.time(for (i in 1:ncol(dat)) {
      if (!is.integer(dat[[i]]))
          set(dat,j=i,value=as.integer(dat[[i]]))
      setattr(dat[[i]],"levels",re.codes)
      setattr(dat[[i]],"class","factor")
    })
    #  user  system elapsed
    #  0.06    0.01    0.08      # on my slow netbook
    
    identical(dat, )
    # [1] TRUE
    

    Note that set also works on data.frame, too. You don't have to convert to data.table to use it.

    These are very small times, clearly. Since it's only a small input dataset :

    dim(dat)
    # [1] 250000     36 
    object.size(dat)
    # 68.7 Mb
    

    Scaling up from this should reveal larger differences. But even so I think it should be (just about) measurably fastest. Not a significant difference that anyone minds about, at this size, though.

    The setattr function is also in the bit package, btw. So the 0.00 method can be done with either data.table or bit. To do the type conversion by reference (if required) either set or := (both in data.table) is needed, afaik.

提交回复
热议问题