问题
I have a gridded dataset, with data available at the following locations:
lon <- seq(-179.75,179.75, by = 0.5)
lat <- seq(-89.75,89.75, by = 0.5)
I would like to find all of the data points that are within 500 km of the location:
mylat <- 47.9625
mylon <- -87.0431
I aim to use the geosphere package in R, but the method I\'ve currently written does not seem very efficient:
require(geosphere)
dd2 <- array(dim = c(length(lon),length(lat)))
for(i in 1:length(lon)){
for(ii in 1:length(lat)){
clon <- lon[i]
clat <- lat[ii]
dd <- as.numeric(distm(c(mylon, mylat), c(clon, clat), fun = distHaversine))
dd2[i,ii] <- dd <= 500000
}
}
Here, I loop through each grid in the data and find if the distance is less than 500 km. I then store a variable with either TRUE or FALSE, which I can then use to average the data (other variable). From this method, I want a matrix with TRUE or FALSE for the locations within 500 km from the lat and lon shown. Is there a more efficient method for doing this?
回答1:
Timings:
Comparing @nicola's and my version gives:
Unit: milliseconds
min lq mean median uq max neval
nicola1 184.217002 219.924647 297.60867 299.181854 322.635960 898.52393 100
floo01 61.341560 72.063197 97.20617 80.247810 93.292233 286.99343 100
nicola2 3.992343 4.485847 5.44909 4.870101 5.371644 27.25858 100
My original solution: (IMHO nicola's second version is much cleaner and faster.)
You can do the following (explanation below)
require(geosphere)
my_coord <- c(mylon, mylat)
dd2 <- matrix(FALSE, nrow=length(lon), ncol=length(lat))
outer_loop_state <- 0
for(i in 1:length(lon)){
coods <- cbind(lon[i], lat)
dd <- as.numeric(distHaversine(my_coord, coods))
dd2[i, ] <- dd <= 500000
if(any(dd2[i, ])){
outer_loop_state <- 1
} else {
if(outer_loop_state == 1){
break
}
}
}
Explanation:
For the loop i apply the following logic:
outer_loop_state
is initialized with 0. If a row with at least one raster-point inside the circle is found outer_loop_state
is set to 1. Once there are no more points within the circle for a given row i
break.
The distm
call in @nicola version basically does the same without this trick. So it calculates all rows.
Code for timings:
microbenchmark::microbenchmark(
{allCoords<-cbind(lon,rep(lat,each=length(lon)))
res<-matrix(distm(cbind(mylon,mylat),allCoords,fun=distHaversine)<=500000,nrow=length(lon))},
{my_coord <- c(mylon, mylat)
dd2 <- matrix(FALSE, nrow=length(lon), ncol=length(lat))
outer_loop_state <- 0
for(i in 1:length(lon)){
coods <- cbind(lon[i], lat)
dd <- as.numeric(distHaversine(my_coord, coods))
dd2[i, ] <- dd <= 500000
if(any(dd2[i, ])){
outer_loop_state <- 1
} else {
if(outer_loop_state == 1){
break
}
}
}},
{#intitialize the return
res<-matrix(FALSE,nrow=length(lon),ncol=length(lat))
#we find the possible value of longitude that can be closer than 500000
#How? We calculate the distance between us and points with our same lat
longood<-which(distm(c(mylon,mylat),cbind(lon,mylat))<500000)
#Same for latitude
latgood<-which(distm(c(mylon,mylat),cbind(mylon,lat))<500000)
#we build the matrix with only those values to exploit the vectorized
#nature of distm
allCoords<-cbind(lon[longood],rep(lat[latgood],each=length(longood)))
res[longood,latgood]<-distm(c(mylon,mylat),allCoords)<=500000}
)
回答2:
The dist*
functions of the geosphere
package are vectorized, so you only need to prepare better your inputs. Try this:
#prepare a matrix with coordinates of every position
allCoords<-cbind(lon,rep(lat,each=length(lon)))
#call the dist function and put the result in a matrix
res<-matrix(distm(cbind(mylon,mylat),allCoords,fun=distHaversine)<=500000,nrow=length(lon))
#check the result
identical(res,dd2)
#[1] TRUE
As the @Floo0 answer showed, there is a lot of unnecessary calculations. We can follow another strategy: we first determine the lon and lat range that can be closer than the threshold and then we use only them to calculate the distance:
#initialize the return
res<-matrix(FALSE,nrow=length(lon),ncol=length(lat))
#we find the possible values of longitude that can be closer than 500000
#How? We calculate the distances between us and points with our same lon
longood<-which(distm(c(mylon,mylat),cbind(lon,mylat))<=500000)
#Same for latitude
latgood<-which(distm(c(mylon,mylat),cbind(mylon,lat))<=500000)
#we build the matrix with only those values to exploit the vectorized
#nature of distm
allCoords<-cbind(lon[longood],rep(lat[latgood],each=length(longood)))
res[longood,latgood]<-distm(c(mylon,mylat),allCoords)<=500000
In this way, you calculate just lg+ln+lg*ln
(lg
and ln
are the length of latgood
and longood
), i.e. 531 distances, opposed to the 259200 with my previous method.
回答3:
I add below a solution using the spatialrisk package. The key functions in this package are written in C++ (Rcpp), and are therefore very fast.
First, load the data:
mylat <- 47.9625
mylon <- -87.0431
lon <- seq(-179.75,179.75, by = 0.5)
lat <- seq(-89.75,89.75, by = 0.5)
df <- expand.grid(lon = lon, lat = lat)
The function spatialrisk::points_in_circle() calculates the observations within radius from a center point. Note that distances are calculated using the Haversine formula.
Timings for the spatialrisk approach compared to the @Hugh version:
spatialrisk::points_in_circle(df, mylon, mylat, radius = 5e5)
Unit: milliseconds
expr min lq mean median uq max neval cld
spatialrisk 3.071897 3.366256 5.224479 4.068124 4.809626 17.24378 100 a
hutils 17.507311 20.788525 29.470707 25.061943 31.066139 268.29375 100 b
The result can be easily converted to a matrix.
Have a look at the excellent answer by @philcolbourn on how to test if a point is inside a circle. See: https://stackoverflow.com/a/7227057/5440749
回答4:
Just use hutils::haversine_distance(lat, lon, mylat, mylon) < 500
directly.
If the points are assumed to be a cross join of the given lat
and lon
, use a cross-join first to obtain them:
library(data.table)
library(hutils)
lon <- seq(-179.75,179.75, by = 0.5)
lat <- seq(-89.75,89.75, by = 0.5)
mylat <- 47.9625
mylon <- -87.0431
Points <- CJ(lon = lon,
lat = lat)
Points[, dist := haversine_distance(lat, lon, mylat, mylon)]
Points[, sum(dist < 500)]
#> [1] 379
Created on 2019-10-24 by the reprex package (v0.3.0)
It improves on the existing answers by its speed and robustness. In particular, it does not rely on the gridded nature of the data and will work with long vectors of coordinates. Below are timings for 100,000 points
# A tibble: 2 x 14
expression min mean median max `itr/sec` mem_alloc n_gc n_itr total_time
<chr> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <bch:tm>
1 nicola2 39891.120ms 39891.120ms 39891.120ms 39891.120ms 0.0251 8808.632MB 0 1 39891.120ms
2 hutils 15.492ms 15.591ms 15.578ms 15.728ms 64.1 5.722MB 0 33 514.497ms
来源:https://stackoverflow.com/questions/39222302/find-locations-within-certain-lat-lon-distance-in-r