I read this article http://www.r-bloggers.com/comparing-hist-and-cut-r-functions/ and tested hist()
to be faster than cut()
by ~4 times on my PC. My sc
Here is an implementation based on your findInterval
suggestion which is 5-6 times faster than classical cut
:
cut2 <- function(x, breaks) {
labels <- paste0("(", breaks[-length(breaks)], ",", breaks[-1L], "]")
return(factor(labels[findInterval(x, breaks)], levels=labels))
}
library(microbenchmark)
set.seed(1)
data <- rnorm(1e4, mean=0, sd=1)
microbenchmark(cut.default(data, my_breaks), cut2(data, my_breaks))
# Unit: microseconds
# expr min lq median uq max neval
# cut.default(data, my_breaks) 3011.932 3031.1705 3046.5245 3075.3085 4119.147 100
# cut2(data, my_breaks) 453.761 459.8045 464.0755 469.4605 1462.020 100
identical(cut(data, my_breaks), cut2(data, my_breaks))
# TRUE
The hist
function creates counts by bins in a similar way to a combination of table
and cut
. For example,
set.seed(1)
x <- rnorm(100)
hist(x, plot = FALSE)
## $breaks
## [1] -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5
##
## $counts
## [1] 1 3 7 14 21 20 19 9 4 2
table(cut(x, seq.int(-2.5, 2.5, 0.5)))
## (-2.5,-2] (-2,-1.5] (-1.5,-1] (-1,-0.5] (-0.5,0] (0,0.5] (0.5,1]
## 1 3 7 14 21 20 19
## (1,1.5] (1.5,2] (2,2.5]
## 9 4 2
If you want the raw output from cut
, you can't use hist
.
However, if the speed of cut
is a problem (and you might want to double check that it really is the slow part of your analysis; see premature optimization is the root of all evil), then you can use the lower level .bincode
. This ignores the input checking and label-creating functions of cut
.
.bincode(x, seq.int(-2.5, 2.5, 0.5))
## [1] 4 6 4 9 6 4 6 7 7 5 9 6 ...