Fuzzy Join Using Time and Geo-coordinates in R

依然范特西╮ 提交于 2020-04-16 03:22:51

问题


There two data frames with disparate information. The only columns they have in common are datetime and lat/long fields. Can one create a third data frame using R or an R package (or possibly Python/Pandas) that takes a subset of rows from both data frames by similar date and lat/long fields? The joins should be fuzzy, not exact, plus/minus an hr and tenth a degree.

Input Example:

df_1
Datetime            Latitude    Longitude
2018-10-01 08:27:10 34.8014080  103.8499800
2018-09-30 04:55:51 43.3367432  44.158934
2018-02-28 17:03:27 37.0399910  115.6672080

df_2
Datetime            Latitude    Longitude
2018-10-01 08:57:10 34.8014080  103.8999800
2018-09-30 04:55:51 43.3367432  48.158934
2018-02-27 17:03:27 37.0399910  115.6672080

Output Example:

fuzzy_geo_temporal_join(df_1, df_2, time = 60, lat = 0.01, long = 0.01)
df_3
df_1 Datetime       df_1 Lat    df_1 Long    df_2 Datetime       df_2 Lat    df_2 Long
2018-10-01 08:27:10 34.8014080  103.8499800  2018-10-01 08:57:10 34.8014080  103.8999800

Note: In this example, the first one matches and gets placed into the new data frame. Due to the fuzzy parameters given, the second and third one do not.


回答1:


This sounds like a job for a non-equi join, using data.table!

library( data.table )

sample data

dt1 <- fread( "Datetime,            Latitude,    Longitude
2018-10-01 08:27:10, 34.8014080,  103.8499800
2018-09-30 04:55:51, 43.3367432,  44.158934
2018-02-28 17:03:27, 37.0399910,  115.6672080", header = T)

dt2  <- fread("Datetime,            Latitude,    Longitude
2018-10-01 08:57:10, 34.8014080,  103.8999800
2018-09-30 04:55:51, 43.3367432,  48.158934
2018-02-27 17:03:27, 37.0399910,  115.6672080", header = T)

data-preparation

#set datetimes to POSIXct
dt1[, Datetime := as.POSIXct( Datetime, format = "%Y-%m-%d %H:%M:%S") ]
dt2[, `:=`(Datetime = as.POSIXct( Datetime, format = "%Y-%m-%d %H:%M:%S" ) )]

join

#create boundaries
dt2[, `:=`(Datetime_max = Datetime + 3600,
           Datetime_min = Datetime - 3600,
           Latitude_max = Latitude + 0.1,
           Latitude_min = Latitude - 0.1,
           Longitude_max = Longitude + 0.1,
           Longitude_min = Longitude - 0.1) ]

#perform non-equi join
dt1[ dt2, on = .( Datetime <= Datetime_max, 
                  Datetime >= Datetime_min, 
                  Latitude <= Latitude_max, 
                  Latitude >= Latitude_min, 
                  Longitude <= Longitude_max, 
                  Longitude >= Longitude_min ),
     nomatch = 0L]

result

#               Datetime Latitude Longitude          Datetime.1 Latitude.1 Longitude.1          i.Datetime i.Latitude i.Longitude
# 1: 2018-10-01 09:57:10 34.90141       104 2018-10-01 07:57:10   34.70141       103.8 2018-10-01 08:57:10   34.80141       103.9



回答2:


This might work...

install.packages("fuzzyjoin")
library(fuzzyjoin)

close_dates <- difference_inner_join(df1, df2, by = "Datetime", max_dist = 60)
close_lats <-  difference_inner_join(close_dates, df2, by = "Latitude", max_dist = 0.01)
df3 <- difference_inner_join(close_lats, df2, by = "Longitude", max_dist = 0.01)


来源:https://stackoverflow.com/questions/52599550/fuzzy-join-using-time-and-geo-coordinates-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!