When I need to filter a data.frame, i.e., extract rows that meet certain conditions, I prefer to use the subset
function:
subset(airquality, Mont
Also [
is faster:
require(microbenchmark)
microbenchmark(subset(airquality, Month == 8 & Temp > 90),airquality[airquality$Month == 8 & airquality$Temp > 90,])
Unit: microseconds
expr min lq median uq max neval
subset(airquality, Month == 8 & Temp > 90) 301.994 312.1565 317.3600 349.4170 500.903 100
airquality[airquality$Month == 8 & airquality$Temp > 90, ] 234.807 239.3125 244.2715 271.7885 340.058 100
This question was answered in well in the comments by @James, pointing to an excellent explanation by Hadley Wickham of the dangers of subset
(and functions like it) [here]. Go read it!
It's a somewhat long read, so it may be helpful to record here the example that Hadley uses that most directly addresses the question of "what can go wrong?":
Hadley suggests the following example: suppose we want to subset and then reorder a data frame using the following functions:
scramble <- function(x) x[sample(nrow(x)), ]
subscramble <- function(x, condition) {
scramble(subset(x, condition))
}
subscramble(mtcars, cyl == 4)
This returns the error:
Error in eval(expr, envir, enclos) : object 'cyl' not found
because R no longer "knows" where to find the object called 'cyl'. He also points out the truly bizarre stuff that can happen if by chance there is an object called 'cyl' in the global environment:
cyl <- 4
subscramble(mtcars, cyl == 4)
cyl <- sample(10, 100, rep = T)
subscramble(mtcars, cyl == 4)
(Run them and see for yourself, it's pretty crazy.)