Why is `[` better than `subset`?

前端 未结 2 1039
暗喜
暗喜 2020-11-21 07:46

When I need to filter a data.frame, i.e., extract rows that meet certain conditions, I prefer to use the subset function:

subset(airquality, Mont         


        
相关标签:
2条回答
  • 2020-11-21 08:06

    Also [ is faster:

    require(microbenchmark)        
    microbenchmark(subset(airquality, Month == 8 & Temp > 90),airquality[airquality$Month == 8 & airquality$Temp > 90,])
        Unit: microseconds
                                                               expr     min       lq   median       uq     max neval
                         subset(airquality, Month == 8 & Temp > 90) 301.994 312.1565 317.3600 349.4170 500.903   100
         airquality[airquality$Month == 8 & airquality$Temp > 90, ] 234.807 239.3125 244.2715 271.7885 340.058   100
    
    0 讨论(0)
  • 2020-11-21 08:15

    This question was answered in well in the comments by @James, pointing to an excellent explanation by Hadley Wickham of the dangers of subset (and functions like it) [here]. Go read it!

    It's a somewhat long read, so it may be helpful to record here the example that Hadley uses that most directly addresses the question of "what can go wrong?":

    Hadley suggests the following example: suppose we want to subset and then reorder a data frame using the following functions:

    scramble <- function(x) x[sample(nrow(x)), ]
    
    subscramble <- function(x, condition) {
      scramble(subset(x, condition))
    }
    
    subscramble(mtcars, cyl == 4)
    

    This returns the error:

    Error in eval(expr, envir, enclos) : object 'cyl' not found

    because R no longer "knows" where to find the object called 'cyl'. He also points out the truly bizarre stuff that can happen if by chance there is an object called 'cyl' in the global environment:

    cyl <- 4
    subscramble(mtcars, cyl == 4)
    
    cyl <- sample(10, 100, rep = T)
    subscramble(mtcars, cyl == 4)
    

    (Run them and see for yourself, it's pretty crazy.)

    0 讨论(0)
提交回复
热议问题