Fastest way to select a valid range for raster data

问题

Using R, I need to select the valid range for a given raster (from package raster) in the fastest possible way. I tried this:

library(raster)
library(microbenchmark)
library(ggplot2)
library(compiler)

r <- raster(ncol=100, nrow=100)
r[] <- runif(ncell(r))

#Let's see if precompiling helps speed...
f <- function(x, min, max) reclassify(x, c(-Inf, min, NA, max, Inf, NA))
g <- cmpfun(f)

#Benchmark!
compare <- microbenchmark(
    calc(r, fun=function(x){ x[x < 0.2] <- NA; x[x > 0.8] <- NA; return(x)}), 
    reclassify(r, c(-Inf, 0.2, NA, 0.8, Inf, NA)),
    g(r, 0.2, 0.8),
    times=100)
autoplot(compare) #Reclassify is much faster, precompiling doesn't help much.

#Check they are the same...
identical(
          calc(r, fun=function(x){ x[x < 0.2] <- NA; x[x > 0.8] <- NA; return(x)}),
          reclassify(r, c(-Inf, 0.2, NA, 0.8, Inf, NA))
) #TRUE
identical(
          reclassify(r, c(-Inf, 0.2, NA, 0.8, Inf, NA)),
          g(r, 0.2, 0.8),
) #TRUE

The reclassify method is much faster, but I'm sure that it can be sped up more. How can I do so?

回答1:

Here is one more way:

h <- function(r, min, max) {
  rr <- r[]
  rr[rr < min | rr > max] <- NA
  r[] <- rr
  r
}

i <- cmpfun(h)

identical(
  i(r, 0.2, 0.8),
  g(r, 0.2, 0.8)
)



#Benchmark!
compare <- microbenchmark(
  calc(r, fun=function(x){ x[x < 0.2] <- NA; x[x > 0.8] <- NA; return(x)}), 
  reclassify(r, c(-Inf, 0.2, NA, 0.8, Inf, NA)),
  g(r, 0.2, 0.8),
  h(r, 0.2, 0.8),
  i(r, 0.2, 0.8),
  times=100)
autoplot(compare)

Compiling doesn't help much in this instance.

You could even gain some further speed up, by accessing slots of the raster object directly using @ (although usually discouraged).

j <- function(r, min, max) {
  v <- r@data@values
  v[v < min | v > max] <- NA
  r@data@values <- v
  r
}

k <- cmpfun(j)

identical(
  j(r, 0.2, 0.8)[],
  g(r, 0.2, 0.8)[]
)

回答2:

While the accepted answer to this question is true for the example raster, it is important to note that the fastest safe function is highly dependent on raster size: the functions h and i presented by @rengis are only faster with relatively small rasters (and relatively simple reclassifications). Just increasing the size of the raster r in the OP's example by a magnitude of ten makes reclassify quicker:

# Code from OP @AF7
library(raster)
library(microbenchmark)
library(ggplot2)
library(compiler)

#Let's see if precompiling helps speed...
f <- function(x, min, max) reclassify(x, c(-Inf, min, NA, max, Inf, NA))
g <- cmpfun(f)

# Funcions from @rengis
h <- function(r, min, max) {
  rr <- r[]
  rr[rr < min | rr > max] <- NA
  r[] <- rr
  r
}

i <- cmpfun(h)

# Benchmark with larger raster (100k cells, vs 10k originally)
r <- raster(ncol = 1000, nrow = 100)
r[] <- runif(ncell(r))

compare <- microbenchmark(
  calc(r, fun=function(x){ x[x < 0.2] <- NA; x[x > 0.8] <- NA; return(x)}), 
  reclassify(r, c(-Inf, 0.2, NA, 0.8, Inf, NA)),
  g(r, 0.2, 0.8),
  h(r, 0.2, 0.8),
  i(r, 0.2, 0.8),
  times=100)
autoplot(compare)

The exact point when reclassify becomes quicker is dependent both on the number of the cells in the raster and on the complexity of the reclassification, but in this case the cross-over point is at about 50,000 cells (see below).

As the raster becomes even larger (or the calculation more complex), another way to speed up reclassification is using multi-threading, e.g. with the snow package:

# Reclassify, using clusterR to split into two threads
library(snow)
tryCatch({
      beginCluster(n = 2)
      clusterR(r, reclassify, args = list(rcl = c(-Inf, 0.2, NA, 0.8, Inf, NA)))
    }, finally = endCluster())

Multi-threading involves even more overhead to set up, and so only makes sense with very large rasters and/or more complex calculations (in fact, I was surprised to note that it didn't come out as the best option under any of the conditions I tested below--perhaps with a more complex reclassification?).

To illustrate, I've plotted results from microbenchmark using the OP's setup at intervals up to 10 million cells (10 runs of each) below:

As a final note, compiling didn't make a difference at any of the tested sizes.

回答3:

The raster package has a function for that: clamp. It is faster than g but slower than h and i because it has some overhead (safety) built in.

compare <- microbenchmark(
  h(r, 0.2, 0.8),
  i(r, 0.2, 0.8),
  clamp(r, 0.2, 0.8),
  g(r, 0.2, 0.8),
  times=100)
autoplot(compare)

来源：https://stackoverflow.com/questions/34064738/fastest-way-to-select-a-valid-range-for-raster-data

标签

r-raster