I have 2 sets of points, set1
and set2
. Both sets of points have a data associated with the point. Points in set1 are "ephemeral", and only exist on the given date. Points in set2 are "permanent", are constructed at a given date, and then exist forever after that date.
set.seed(1)
dates <- seq(as.Date('2011-01-01'),as.Date('2011-12-31'),by='days')
set1 <- data.frame(lat=40+runif(10000),
lon=-70+runif(10000),date=sample(dates,10000,replace=TRUE))
set2 <- data.frame(lat=40+runif(100),
lon=-70+runif(100),date=sample(dates,100,replace=TRUE))
Here's my problem: For each point in set1 (ephemeral) find the distance to the closest point in set2 (permanent) that was constructed BEFORE the event is set1 occurred. For example, the 1st point in set1 occurred on 2011-03-18:
> set1[1,]
lat lon date
1 40.26551 -69.93529 2011-03-18
So I want to find the closest point in set2 that was constructed before 2011-03-18:
> head(set2[set2$date<=as.Date('2011-04-08'),])
lat lon date
1 40.41531 -69.25765 2011-02-18
7 40.24690 -69.29812 2011-02-19
13 40.10250 -69.52515 2011-02-12
14 40.53675 -69.28134 2011-02-27
17 40.66236 -69.07396 2011-02-17
20 40.67351 -69.88217 2011-01-04
The additional wrinkle is that these are latitude/longitude points, so I have to calculate distances along the surface of the earth. The R package fields provides a convienent function to do this:
require(fields)
distMatrix <- rdist.earth(set1[,c('lon','lat')],
set2[,c('lon','lat')], miles = TRUE)
My question is, how can I adjust the distances in this matrix to Inf
if the point in set2 (column of distance matrix) was constructed after the point in set1 (row of distances matrix)?
Here is what I would do:
earlierMatrix <- outer(set1$date, set2$date, "<=")
distMatrix2 <- distMatrix + ifelse(earlierMatrix, Inf, 0)
Here's my attempt at an answer. It's not particularly efficient, but I think it is correct. It also allows you to easily sub in different distance calculators:
#Calculate distances
require(fields)
distMatrix <- lapply(1:nrow(set1),function(x) {
#Find distances to all points
distances <- rdist.earth(set1[x,c('lon','lat')], set2[,c('lon','lat')], miles = TRUE)
#Set distance to Inf if the set1 point occured BEFORE the set2 dates
distances <- ifelse(set1[x,'date']<set2[,'date'], Inf, distances)
return(distances)
})
distMatrix <- do.call(rbind,distMatrix)
#Find distance to closest object
set1$dist <- apply(distMatrix,1,min)
#Find id of closest object
objectID <- lapply(1:nrow(set1),function(x) {
if (set1[x,'dist']<Inf) {
IDs <- which(set1[x,'dist']==distMatrix[x,])
} else {
IDs <- NA
}
return(sample(IDs,1)) #Randomly break ties (if there are any)
})
set1$objectID <- do.call(rbind,objectID)
Here's the head of the resulting dataset:
> head(set1)
lat lon date dist objectID
1 40.26551 -69.93529 2011-03-18 3.215514 13
2 40.37212 -69.32339 2011-02-11 10.320910 46
3 40.57285 -69.26463 2011-02-23 3.954132 4
4 40.90821 -69.88870 2011-04-24 4.132536 49
5 40.20168 -69.95335 2011-02-24 4.284692 45
6 40.89839 -69.86909 2011-07-12 3.385769 57
来源:https://stackoverflow.com/questions/8509329/finding-nearest-neighbor-between-2-sets-of-dated-points