Challenge: recoding a data.frame() — make it faster

后端 未结 6 1633
借酒劲吻你
借酒劲吻你 2021-02-04 06:43

Recoding is a common practice for survey data, but the most obvious routes take more time than they should.

The fastest code that accomplishes the same task with the pr

相关标签:
6条回答
  • 2021-02-04 06:45

    Making factors is expensive; only doing it once is comparable with the commands using structure, and in my opinion, preferable as you don't have to depend on how factors happen to be constructed.

    rc <- factor(re.codes, levels=re.codes)
    dat5 <- as.data.frame(lapply(dat, function(d) rc[d]))
    

    EDIT 2: Interestingly, this seems to be a case where lapply does speed things up. This for loop is substantially slower.

    for(i in seq_along(dat)) {
      dat[[i]] <- rc[dat[[i]]]
    }
    

    EDIT 1: You can also speed things up by being more precise with your types. Try any of the solutions (but especially your original one) creating your data as integers, as follows. For details, see a previous answer of mine here.

    dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1L,2L,4L,5L,3L),50000))
    

    This is also a good idea as converting to integers from floating points, as is being done in all of the faster solutions here, can give unexpected behavior, see this question.

    0 讨论(0)
  • 2021-02-04 06:46

    Try this:

    m <- as.matrix(dat)
    
    dat <- data.frame( matrix( re.codes[m], nrow = nrow(m)))
    
    0 讨论(0)
  • 2021-02-04 06:50

    Combining @DWin's answer, and my answer from Most efficient list to data.frame method?:

    system.time({
      dat3 <- list()
      # define attributes once outside of loop
      attrib <- list(class="factor", levels=re.codes)
      for (i in names(dat)) {              # loop over each column in 'dat'
        dat3[[i]] <- as.integer(dat[[i]])  # convert column to integer
        attributes(dat3[[i]]) <- attrib    # assign factor attributes
      }
      # convert 'dat3' into a data.frame. We can do it like this because:
      # 1) we know 'dat' and 'dat3' have the same number of rows and columns
      # 2) we want 'dat3' to have the same colnames as 'dat'
      # 3) we don't care if 'dat3' has different rownames than 'dat'
      attributes(dat3) <- list(row.names=c(NA_integer_,nrow(dat)),
        class="data.frame", names=names(dat))
    })
    identical(dat2, dat3)  # 'dat2' is from @Dwin's answer
    
    0 讨论(0)
  • 2021-02-04 06:52

    The help page for class() says that class<- is deprecated and to use as. methods. I haven't quite figured out why the earlier effort was reporting 0 observations when the data was obviously in the object, but this method results in a complete object:

        system.time({ dat2 <- vector(mode="list", length(dat))
          for (i in 1:length(dat) ){ dat2[[i]] <- dat[[i]]
            storage.mode(dat2[[i]]) <- "integer"
                   attributes(dat2[[i]]) <- list(class="factor", levels=re.codes)}
      names(dat2) <- names(dat)
      dat2 <- as.data.frame(dat2)})
    #--------------------------  
      user  system elapsed 
      0.266   0.290   0.560 
    > str(dat2)
    'data.frame':   250000 obs. of  36 variables:
     $ V1 : Factor w/ 5 levels "This","That",..: 1 2 3 4 5 1 2 3 4 5 ...
     $ V2 : Factor w/ 5 levels "This","That",..: 5 4 3 2 1 5 4 3 2 1 ...
     $ V3 : Factor w/ 5 levels "This","That",..: 1 2 4 5 3 1 2 4 5 3 ...
     $ V4 : Factor w/ 5 levels "This","That",..: 1 2 3 4 5 1 2 3 4 5 ...
     $ V5 : Factor w/ 5 levels "This","That",..: 5 4 3 2 1 5 4 3 2 1 ...
     $ V6 : Factor w/ 5 levels "This","That",..: 1 2 4 5 3 1 2 4 5 3 ...
     $ V7 : Factor w/ 5 levels "This","That",..: 1 2 3 4 5 1 2 3 4 5 ...
     $ V8 : Factor w/ 5 levels "This","That",..: 5 4 3 2 1 5 4 3 2 1 ...
     snipped
    

    All 36 columns are there.

    0 讨论(0)
  • 2021-02-04 06:54

    A data.table answer for your consideration. We're just using setattr() from it, which works on data.frame, and columns of data.frame. No need to convert to data.table.

    The test data again :

    dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1L,2L,4L,5L,3L),50000)) 
    dat <- cbind(dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat) 
    dat <- as.data.frame(dat) 
    re.codes <- c("This","That","And","The","Other") 
    

    Now change the class and set the levels of each column directly, by reference :

    require(data.table)
    system.time(for (i in 1:ncol(dat)) {
      setattr(dat[[i]],"levels",re.codes)
      setattr(dat[[i]],"class","factor")
    }
    # user  system elapsed 
    #   0       0       0 
    
    identical(dat, <result in question>)
    # [1] TRUE
    

    Does 0.00 win? As you increase the size of the data, this method stays at 0.00.

    Ok, I admit, I changed the input data slightly to be integer for all columns (the question has double input data in a third of the columns). Those double columns have to be converted to integer because factor is only valid for integer vectors. As mentioned in the other answers.

    So, strictly with the input data in the question, and including the double to integer conversion :

    dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1,2,4,5,3),50000))             
    dat <- cbind(dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat)               
    dat <- as.data.frame(dat)               
    re.codes <- c("This","That","And","The","Other")           
    
    system.time(for (i in 1:ncol(dat)) {
      if (!is.integer(dat[[i]]))
          set(dat,j=i,value=as.integer(dat[[i]]))
      setattr(dat[[i]],"levels",re.codes)
      setattr(dat[[i]],"class","factor")
    })
    #  user  system elapsed
    #  0.06    0.01    0.08      # on my slow netbook
    
    identical(dat, <result in question>)
    # [1] TRUE
    

    Note that set also works on data.frame, too. You don't have to convert to data.table to use it.

    These are very small times, clearly. Since it's only a small input dataset :

    dim(dat)
    # [1] 250000     36 
    object.size(dat)
    # 68.7 Mb
    

    Scaling up from this should reveal larger differences. But even so I think it should be (just about) measurably fastest. Not a significant difference that anyone minds about, at this size, though.

    The setattr function is also in the bit package, btw. So the 0.00 method can be done with either data.table or bit. To do the type conversion by reference (if required) either set or := (both in data.table) is needed, afaik.

    0 讨论(0)
  • 2021-02-04 06:56

    My computer is obviously much slower, but structure is a pretty fast way to do this:

    > system.time({
    + dat1 <- dat
    + for(x in 1:ncol(dat)) {
    +   dat1[,x] <- factor(dat1[,x], labels=re.codes)
    +   }
    + })
       user  system elapsed 
     11.965   3.172  15.164 
    > 
    > system.time({
    + m <- as.matrix(dat)
    + dat2 <- data.frame( matrix( re.codes[m], nrow = nrow(m)))
    + })
       user  system elapsed 
      2.100   0.516   2.621 
    > 
    > system.time(dat3 <- data.frame(lapply(dat, structure, class='factor', levels=re.codes)))
       user  system elapsed 
      0.484   0.332   0.820 
    
    # this isn't because the levels get re-ordered
    > all.equal(dat1, dat2)
    
    > all.equal(dat1, dat3)
    [1] TRUE
    
    0 讨论(0)
提交回复
热议问题