问题
I would like to pick the closest previous owner within a set distance only after the first search condition isn't met. The locations are called reflo
(reference location), and they have a corresponding x and y coordinates (called locx
and locy
, respectively).
The conditions:
- if
lifetime_census$reflo==owners$reflo.x[i]
then condition is met - if
lifetime_census$reflo!=owners$reflo.x[i]
, then find next closest record (within 30 meters) - if there is no record within 30 meters, then assign
NA
Previous owners (>20,000) are stored in a dataset called lifetime_census
. Here is a sample of the data:
id previous_id reflo locx locy lifespan
16161 5587 -310 -3 10 1810
16848 5101 Q1 17.3 0.8 55
21815 6077 M2 13 1.8 979
23938 6130 -49 -4 9 374
29615 7307 B.1 2.5 1 1130
I then have an owners
dataset (here is a sample):
squirrel_id spr_census reflo.x spring_locx spring_locy
6391 2005 M3 13 2.5
6130 2005 -310 -3 10
23586 2019 B9 2 9
To illustrate what I am trying to achieve:
squirrel_id spr_census reflo.x spring_locx spring_locy previous_owner
6391 2004 M3 13 2.5 6077
6130 2005 -310 -3 10 5587
23586 2019 B9 2 9 NA
What I have currently tried is this:
n <- length(owners$squirrel_id)
distance <- 30 #This can be easily changed to bigger or smaller values
for(i in 1:n) {
last_owner <- subset(lifetime_census,
lifetime_census$reflo==owners$reflo.x[i] & #using the exact location
((30*owners$spring_locx[i]-30* lifetime_census$locx)^2+(30* owners$spring_locy[i]-30* lifetime_census$locy)^2<=(distance)^2)) #this sets the search limit
owners[i,"previous_owner"] <- last_owner$previous_id[i]
}
I cannot figure out how to have the loop go through the conditions in order and then select the record within the search limit only after no exact match was found.
Any ideas?
回答1:
I would suggest something like this (asumming the units for locx
and alike are the same as for distance
:
distance = 30
distance_xy = function (x1, y1, x2, y2) {
sqrt((x2 - x1)^2 + (y2 -y1)^2)
}
for (i in 1:dim(owners)[1]) {
if (owners$reflo.x[i] %in% lifetime_census$reflo) {
owners$previous_owner[i] = lifetime_census[lifetime_census$reflo == owners$reflo.x[i], ]$previous_id
} else {
dt = distance_xy(owners$spring_locx[i], owners$spring_locy[i], lifetime_census$locx, lifetime_census$locy)
if (any(dt <= distance)) {
owners$previous_owner[i] = lifetime_census[order(dt), ]$previous_id[1L]
} else {
owners$previous_id[i] = NA
}
}
}
which gives:
squirrel_id spr_census reflo.x spring_locx spring_locy previous_owner
1 6391 2005 M3 13 2.5 6077
2 6130 2005 -310 -3 10.0 5587
3 23586 2019 B9 2 9.0 5587
Note that this will fail if there are more than one match for reflo
.
[EDIT] Adding an alternative based on comment below.
if
-else
statements can get pretty confusing when you start adding conditions. This is another way of achieving the same while avoiding the nested structure above:
for (i in 1:dim(owners)[1]) {
# if we find the reflo
if (owners$reflo.x[i] %in% lifetime_census$reflo) {
owners$previous_owner[i] = lifetime_census[lifetime_census$reflo == owners$reflo.x[i], ]$previous_id
next
}
# if we got here, then we didn't find the reflo, compute distances:
dt = distance_xy(owners$spring_locx[i], owners$spring_locy[i], lifetime_census$locx, lifetime_census$locy)
# if we find anyone within distance, get the closest one
if (any(dt <= distance)) {
owners$previous_owner[i] = lifetime_census[order(dt), ]$previous_id[1L]
next
}
# if we got here, there was nobody within range, set NA and move on:
owners$previous_id[i] = NA
}
The code does exactly the same, but by taking advantage of the for
loop and next
it is possible to remove every else
and the hole nested structure.
回答2:
Since you have 2 sets of criteria, I suggest splitting the task into two parts as well. Also, when combining two dataframes, I always suggest finding a suitable join.
For the exact matches, dplyr::inner_join
will give you the right rows.
For the next part, you can exclude the exact matches and use distance_left_join
from the fuzzyjoin
package to match the remaining rows. It comes with an option for maximum distance as well.
Then, you can simply bind the two results
library(data.table)
lifetime_census <- fread('id previous_id reflo locx locy lifespan
16161 5587 -310 -3 10 1810
16848 5101 Q1 17.3 0.8 55
21815 6077 M2 13 1.8 979
23938 6130 -49 -4 9 374
29615 7307 B.1 3 1 1130')
lifetime_census
#> id previous_id reflo locx locy lifespan
#> 1: 16161 5587 -310 -3.0 10.0 1810
#> 2: 16848 5101 Q1 17.3 0.8 55
#> 3: 21815 6077 M2 13.0 1.8 979
#> 4: 23938 6130 -49 -4.0 9.0 374
#> 5: 29615 7307 B.1 3.0 1.0 1130
owners <- fread('squirrel_id spr_census reflo.x spring_locx spring_locy
6391 2005 M3 13 2.5
6130 2005 -310 -3 10
23586 2019 B9 2 9')
owners
#> squirrel_id spr_census reflo.x spring_locx spring_locy
#> 1: 6391 2005 M3 13 2.5
#> 2: 6130 2005 -310 -3 10.0
#> 3: 23586 2019 B9 2 9.0
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#>
#> between, first, last
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(fuzzyjoin)
# Search for exact match
df1 <- inner_join(owners,lifetime_census ,by=c(reflo.x='reflo')) %>%
select(squirrel_id:spring_locy,previous_id)
df1
#> squirrel_id spr_census reflo.x spring_locx spring_locy previous_id
#> 1 6130 2005 -310 -3 10 5587
df2 <-
owners %>%
anti_join(df1,by=c('squirrel_id')) %>% # Remove rows with exact matches
distance_left_join(lifetime_census,
by=c(spring_locx='locx',spring_locy='locy'), # Match columns
max_dist=1, # Since you want a maximum distance of 30m = 1 unit
distance_col='dist') %>% # Optional, if you want to see the distance
select(squirrel_id:spring_locy,previous_id,dist)
bind_rows(df1,df2)
#> squirrel_id spr_census reflo.x spring_locx spring_locy previous_id dist
#> 1 6130 2005 -310 -3 10.0 5587 NA
#> 2 6391 2005 M3 13 2.5 6077 0.7
#> 3 23586 2019 B9 2 9.0 NA NA
Created on 2020-03-02 by the reprex package (v0.3.0)
回答3:
The following solves the problem.
Function to calculate distances:
distance_xy = function (x1, y1, x2, y2) {
sqrt((x2 - x1)^2 + (y2 -y1)^2)
}
Determine the previous id within a distance of 30 meters. Set id equal to NA in case all distances are greater than 30 meters.
library(tidyverse)
previous_id_fn <- function(v, w, years){
dists <- map2_dbl(lifetime_census$locx, lifetime_census$locy, ~distance_xy(.x, .y, v, w))
df <- data.frame(previous = lifetime_census$previous_id,
dist = dists,
life = lifetime_census$lifespan) %>%
filter(life < years)
id <- df$previous[[which.min(df$dist)]]
if (min(df$dist, na.rm = TRUE) > 30) { id <- NA }
return(id)
}
First join data.frame owners with data.frame lifetime_census to obtain a column with previous_id. Then apply the above defined function to each row of the data.frame.
owners %>%
left_join(., lifetime_census, by = c("reflo.x" = "reflo")) %>%
select(squirrel_id:spring_locy, previous_id) %>%
rowwise() %>%
mutate(previous_id = ifelse(is.na(previous_id),
previous_id_fn(spring_locx, spring_locy, 1000),
previous_id))
Edit:
I added an argument years to function previous_id_fn(). In case lifespan > years the function now returns NA.
来源:https://stackoverflow.com/questions/60327842/if-else-if-pick-first-matching-record-within-set-distance-only-after-first-cond