Analyzing spatial data between two points in R using a very large data set

问题

This is my first time writing code in R from scratch and I'm struggling with how to approach it. I'm looking at turtle nests and their proximity to light sources (i.e. houses, light poles, etc.) to determine how often a light source is within a given radius of a nest.

These are both very large data sets (hundreds of thousands of rows) so the code will likely need to run a loop for each nest position. GPS coordinates for both data sets are in decimal degrees.

The nest data is essentially latitude, longitude, date observed, and species (if known)

The light source data is latitude, longitude, type, and several other light-related parameters I'd like to keep in the data set.

Any suggestions on how to loop through the nest coordinates to determine light sources within radius, r, would be greatly appreciated! For each light source within r for a nest, I'd like for the end result to spit out the entire row of light source data (type, location, additional light-related parameters, etc.) if that is possible rather than just say how many values were T vs. F for being inside r. Thanks!

> Nest <- read.csv("Nest.csv", header=T)
> Lights <- read.csv("Lights.csv", header=T)
> #Nest
> dput(droplevels(Nest[1:10, ]))
structure(list(LAT = c(34.146535, 34.194585, 34.216854, 34.269901, 
34.358718, 34.37268, 34.380848, 34.394183, 34.410384, 34.415077
), LONG = c(-77.839787, -77.804013, -77.787032, -77.742722, -77.63655, 
-77.619872, -77.609373, -77.591654, -77.568456, -77.561256), 
    DATE = structure(c(2L, 3L, 4L, 5L, 6L, 8L, 9L, 10L, 1L, 7L
    ), .Label = c("2016-05-19T03:12", "2016-05-21T07:23", "2016-05-23T08:14", 
    "2016-05-24T04:21", "2016-05-25T11:15", "2016-05-27T05:12", 
    "2016-05-27T09:45", "2016-05-28T09:42", "2016-05-28T10:18", 
    "2016-05-29T02:26"), class = "factor"), SPECIES = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Cc", class = "factor")), row.names = c(NA, 
10L), class = "data.frame")
> #Lights
> dput(droplevels(Lights[1:10, ]))
structure(list(LAT = c(34.410925, 34.410803, 34.410686, 34.410476, 
34.410361, 34.410237, 34.410151, 34.410016, 34.409821, 34.409671
), LONG = c(-77.568183, -77.568296, -77.568478, -77.568757, -77.568915, 
-77.569135, -77.569355, -77.569527, -77.569707, -77.569905), 
    DATE = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
    ), .Label = "5/19/2016", class = "factor"), TYPE = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "R", class = "factor"), 
    WATTS = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, 
10L), class = "data.frame")

回答1:

As you stated that your data sets were large, the proposed solution tries to avoid a full cartesian product between all Nest vs all Lamps.
For this, we use the non equi join possibilities of data.table which only allows simple operators like > or <.
This allows to make a first filter of the Lamps in a box around a Nest.
This box should be large enough to contain the circle of the max distance to Nest.
In a second step, we calculate the distance on the filtered data (much less calculation than a cartesian product of all data) :

library(data.table)
library(geosphere)

#To data.table
setDT(Nest)
setDT(Lights)

# Define a box around each nest
dlon<- 0.001
dlat <- 0.001

Nest[,c("LATNest","LONGNest","latmin","latmax","longmin","longmax"):=.(LAT,LONG,LAT-dlat, LAT+dlat,LONG-dlon,LONG+dlon)]
Nest[,c("LAT","LONG") :=.(NULL,NULL)]

# Search lights in box
LightNearNest <- Nest[Lights, .(LATNest,LONGNest, LATLight = LAT, LONGLight = LONG), on = .(latmin<LAT , latmax>LAT,longmin<LONG,longmax>LONG),nomatch=0,allow.cartesian=T]     


# Calculate distance 
LightNearNest[,dist:= geosphere::distHaversine(cbind(LONGNest,LATNest),cbind(LONGLight,LATNest))]
LightNearNest

    LATNest  LONGNest LATLight LONGLight      dist
1: 34.41038 -77.56846 34.41092 -77.56818 25.072269
2: 34.41038 -77.56846 34.41080 -77.56830 14.694370
3: 34.41038 -77.56846 34.41069 -77.56848  2.020476
4: 34.41038 -77.56846 34.41048 -77.56876 27.643784
5: 34.41038 -77.56846 34.41036 -77.56892 42.154475
6: 34.41038 -77.56846 34.41024 -77.56914 62.359234
7: 34.41038 -77.56846 34.41015 -77.56936 82.563993

来源：https://stackoverflow.com/questions/62462970/analyzing-spatial-data-between-two-points-in-r-using-a-very-large-data-set

标签

loops

Spatial