If/else if: pick first matching record within set distance only after first condition is not met in R

强颜欢笑 提交于 2020-03-23 23:17:14

问题


I would like to pick the closest previous owner within a set distance only after the first search condition isn't met. The locations are called reflo (reference location), and they have a corresponding x and y coordinates (called locx and locy, respectively).

The conditions:

  • if lifetime_census$reflo==owners$reflo.x[i] then condition is met
  • if lifetime_census$reflo!=owners$reflo.x[i], then find next closest record (within 30 meters)
  • if there is no record within 30 meters, then assign NA

Previous owners (>20,000) are stored in a dataset called lifetime_census. Here is a sample of the data:

id         previous_id  reflo  locx    locy   lifespan  
16161      5587         -310    -3     10     1810    
16848      5101         Q1      17.3   0.8    55    
21815      6077         M2      13     1.8    979
23938      6130         -49     -4     9      374
29615      7307         B.1     2.5    1      1130

I then have an owners dataset (here is a sample):

squirrel_id      spr_census reflo.x    spring_locx      spring_locy 
6391              2005       M3           13             2.5  
6130              2005       -310         -3             10    
23586             2019       B9           2              9

To illustrate what I am trying to achieve:

squirrel_id spr_census reflo.x spring_locx spring_locy previous_owner   
6391        2004       M3       13         2.5         6077            
6130        2005       -310     -3         10          5587   
23586       2019       B9       2          9           NA

What I have currently tried is this:

n <- length(owners$squirrel_id)
distance <- 30 #This can be easily changed to bigger or smaller values

for(i in 1:n) {
  last_owner <- subset(lifetime_census,
    lifetime_census$reflo==owners$reflo.x[i] & #using the exact location
((30*owners$spring_locx[i]-30* lifetime_census$locx)^2+(30* owners$spring_locy[i]-30* lifetime_census$locy)^2<=(distance)^2)) #this sets the search limit

owners[i,"previous_owner"] <- last_owner$previous_id[i]

}

I cannot figure out how to have the loop go through the conditions in order and then select the record within the search limit only after no exact match was found.

Any ideas?


回答1:


I would suggest something like this (asumming the units for locx and alike are the same as for distance:

distance = 30

distance_xy = function (x1, y1, x2, y2) {
  sqrt((x2 - x1)^2 + (y2 -y1)^2)
}

for (i in 1:dim(owners)[1]) {
  if (owners$reflo.x[i] %in% lifetime_census$reflo) {
    owners$previous_owner[i] = lifetime_census[lifetime_census$reflo == owners$reflo.x[i], ]$previous_id
  } else {
    dt = distance_xy(owners$spring_locx[i], owners$spring_locy[i], lifetime_census$locx, lifetime_census$locy)
      if (any(dt <= distance)) {
        owners$previous_owner[i] = lifetime_census[order(dt), ]$previous_id[1L]
      } else {
        owners$previous_id[i] = NA
      }
    }
  }

which gives:

   squirrel_id spr_census reflo.x spring_locx spring_locy previous_owner
1        6391       2005      M3          13         2.5           6077
2        6130       2005    -310          -3        10.0           5587
3       23586       2019      B9           2         9.0           5587

Note that this will fail if there are more than one match for reflo.

[EDIT] Adding an alternative based on comment below.

if-else statements can get pretty confusing when you start adding conditions. This is another way of achieving the same while avoiding the nested structure above:

for (i in 1:dim(owners)[1]) {

  # if we find the reflo
  if (owners$reflo.x[i] %in% lifetime_census$reflo) {
    owners$previous_owner[i] = lifetime_census[lifetime_census$reflo == owners$reflo.x[i], ]$previous_id
    next
  }

  # if we got here, then we didn't find the reflo, compute distances:
  dt = distance_xy(owners$spring_locx[i], owners$spring_locy[i], lifetime_census$locx, lifetime_census$locy)

  # if we find anyone within distance, get the closest one
  if (any(dt <= distance)) {
    owners$previous_owner[i] = lifetime_census[order(dt), ]$previous_id[1L]
    next
  }

  # if we got here, there was nobody within range, set NA and move on:
  owners$previous_id[i] = NA
}

The code does exactly the same, but by taking advantage of the for loop and next it is possible to remove every else and the hole nested structure.




回答2:


Since you have 2 sets of criteria, I suggest splitting the task into two parts as well. Also, when combining two dataframes, I always suggest finding a suitable join.

For the exact matches, dplyr::inner_join will give you the right rows.

For the next part, you can exclude the exact matches and use distance_left_join from the fuzzyjoin package to match the remaining rows. It comes with an option for maximum distance as well.

Then, you can simply bind the two results

library(data.table)
lifetime_census <- fread('id         previous_id  reflo  locx    locy   lifespan  
16161      5587         -310    -3     10     1810    
16848      5101         Q1      17.3   0.8    55    
21815      6077         M2      13     1.8    979
23938      6130         -49     -4     9      374
29615      7307         B.1     3      1      1130')
lifetime_census
#>       id previous_id reflo locx locy lifespan
#> 1: 16161        5587  -310 -3.0 10.0     1810
#> 2: 16848        5101    Q1 17.3  0.8       55
#> 3: 21815        6077    M2 13.0  1.8      979
#> 4: 23938        6130   -49 -4.0  9.0      374
#> 5: 29615        7307   B.1  3.0  1.0     1130
owners <- fread('squirrel_id      spr_census reflo.x    spring_locx      spring_locy 
6391              2005       M3           13             2.5  
6130              2005       -310         -3             10    
23586             2019       B9           2              9')
owners
#>    squirrel_id spr_census reflo.x spring_locx spring_locy
#> 1:        6391       2005      M3          13         2.5
#> 2:        6130       2005    -310          -3        10.0
#> 3:       23586       2019      B9           2         9.0

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#> 
#>     between, first, last
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(fuzzyjoin)

# Search for exact match
df1 <- inner_join(owners,lifetime_census ,by=c(reflo.x='reflo')) %>% 
  select(squirrel_id:spring_locy,previous_id)
df1
#>   squirrel_id spr_census reflo.x spring_locx spring_locy previous_id
#> 1        6130       2005    -310          -3          10        5587


df2 <- 
  owners %>% 
  anti_join(df1,by=c('squirrel_id')) %>% # Remove rows with exact matches
    distance_left_join(lifetime_census,
                       by=c(spring_locx='locx',spring_locy='locy'), # Match columns
                       max_dist=1, # Since you want a maximum distance of 30m = 1 unit
                       distance_col='dist') %>% # Optional, if you want to see the distance
    select(squirrel_id:spring_locy,previous_id,dist)

bind_rows(df1,df2)  
#>   squirrel_id spr_census reflo.x spring_locx spring_locy previous_id dist
#> 1        6130       2005    -310          -3        10.0        5587   NA
#> 2        6391       2005      M3          13         2.5        6077  0.7
#> 3       23586       2019      B9           2         9.0          NA   NA

Created on 2020-03-02 by the reprex package (v0.3.0)




回答3:


The following solves the problem.

Function to calculate distances:

distance_xy = function (x1, y1, x2, y2) {
  sqrt((x2 - x1)^2 + (y2 -y1)^2)
}

Determine the previous id within a distance of 30 meters. Set id equal to NA in case all distances are greater than 30 meters.

library(tidyverse)

previous_id_fn <- function(v, w, years){
   dists <- map2_dbl(lifetime_census$locx, lifetime_census$locy, ~distance_xy(.x, .y, v, w)) 
   df <- data.frame(previous = lifetime_census$previous_id, 
                    dist = dists, 
                    life = lifetime_census$lifespan) %>% 
               filter(life < years)
   id <- df$previous[[which.min(df$dist)]]
   if (min(df$dist, na.rm = TRUE) > 30) { id <- NA }
   return(id)
}

First join data.frame owners with data.frame lifetime_census to obtain a column with previous_id. Then apply the above defined function to each row of the data.frame.

owners %>%
  left_join(., lifetime_census, by = c("reflo.x" = "reflo")) %>%
  select(squirrel_id:spring_locy, previous_id) %>%
  rowwise() %>%
  mutate(previous_id = ifelse(is.na(previous_id), 
                            previous_id_fn(spring_locx, spring_locy, 1000), 
  previous_id))

Edit:

I added an argument years to function previous_id_fn(). In case lifespan > years the function now returns NA.



来源:https://stackoverflow.com/questions/60327842/if-else-if-pick-first-matching-record-within-set-distance-only-after-first-cond

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!