Say I have this data frame:
# Set random seed
set.seed(33550336)
# Number of IDs
n <- 5
# Create data frames
df <- data.frame(ID = rep(1:n, each = 10),
loc = seq(10, 100, by =10))
# ID loc
# 1 1 10
# 2 1 20
# 3 1 30
# 4 1 40
# 5 1 50
# 6 1 60
# 7 1 70
# 8 1 80
# 9 1 90
# 10 1 100
# 11 2 10
# 12 2 20
# 13 2 30
# 14 2 40
# 15 2 50
# 16 2 60
# 17 2 70
# 18 2 80
# 19 2 90
# 20 2 100
# 21 3 10
# 22 3 20
# 23 3 30
# 24 3 40
# 25 3 50
# 26 3 60
# 27 3 70
# 28 3 80
# 29 3 90
# 30 3 100
# 31 4 10
# 32 4 20
# 33 4 30
# 34 4 40
# 35 4 50
# 36 4 60
# 37 4 70
# 38 4 80
# 39 4 90
# 40 4 100
# 41 5 10
# 42 5 20
# 43 5 30
# 44 5 40
# 45 5 50
# 46 5 60
# 47 5 70
# 48 5 80
# 49 5 90
# 50 5 100
Now, I have a second data frame that I'd like to join to it:
df_alt <- data.frame(ID = rep(1:n, each = 10),
loc = sample(1:100, 5 * n, replace = TRUE),
value = runif(n))
# ID loc value
# 1 1 87 0.3202490
# 2 1 36 0.4724253
# 3 1 53 0.4750352
# 4 1 7 0.8744985
# 5 1 38 0.2016645
# 6 1 92 0.3202490
# 7 1 74 0.4724253
# 8 1 72 0.4750352
# 9 1 73 0.8744985
# 10 1 95 0.2016645
# 11 2 61 0.3202490
# 12 2 5 0.4724253
# 13 2 87 0.4750352
# 14 2 11 0.8744985
# 15 2 10 0.2016645
# 16 2 25 0.3202490
# 17 2 60 0.4724253
# 18 2 62 0.4750352
# 19 2 52 0.8744985
# 20 2 31 0.2016645
# 21 3 3 0.3202490
# 22 3 43 0.4724253
# 23 3 45 0.4750352
# 24 3 91 0.8744985
# 25 3 51 0.2016645
# 26 3 87 0.3202490
# 27 3 36 0.4724253
# 28 3 53 0.4750352
# 29 3 7 0.8744985
# 30 3 38 0.2016645
# 31 4 92 0.3202490
# 32 4 74 0.4724253
# 33 4 72 0.4750352
# 34 4 73 0.8744985
# 35 4 95 0.2016645
# 36 4 61 0.3202490
# 37 4 5 0.4724253
# 38 4 87 0.4750352
# 39 4 11 0.8744985
# 40 4 10 0.2016645
# 41 5 25 0.3202490
# 42 5 60 0.4724253
# 43 5 62 0.4750352
# 44 5 52 0.8744985
# 45 5 31 0.2016645
# 46 5 3 0.3202490
# 47 5 43 0.4724253
# 48 5 45 0.4750352
# 49 5 91 0.8744985
# 50 5 51 0.2016645
I'd like a perfect match for ID
and the closest match for loc
. I looked at the fuzzyjoin
package, but unfortunately you cannot have different levels of fuzziness for different columns. That is, I cannot specify a perfect match for ID
and a fuzzy match for loc
. So, as a work around I do a left join by ID
, calculate the distance between loc.x
and loc.y
(i.e., loc
s from the df
and df_alt
data frames, respectively), group by ID
and loc.x
, sort by distance between loc
s, and take the first row (i.e., the shortest distance):
# Bind and find nearest
df_res <- df %>%
left_join(df_alt, by = "ID") %>%
mutate(delta = abs(loc.x - loc.y)) %>%
group_by(ID, loc.x) %>%
arrange(delta) %>%
filter(row_number() == 1) %>%
ungroup %>%
arrange(ID, loc.x)
# # A tibble: 50 x 5
# ID loc.x loc.y value delta
# <int> <dbl> <int> <dbl> <dbl>
# 1 1 10 7 0.874 3
# 2 1 20 7 0.874 13
# 3 1 30 36 0.472 6
# 4 1 40 38 0.202 2
# 5 1 50 53 0.475 3
# 6 1 60 53 0.475 7
# 7 1 70 72 0.475 2
# 8 1 80 74 0.472 6
# 9 1 90 92 0.320 2
# 10 1 100 95 0.202 5
# 11 2 10 10 0.202 0
# 12 2 20 25 0.320 5
# 13 2 30 31 0.202 1
# 14 2 40 31 0.202 9
# 15 2 50 52 0.874 2
# 16 2 60 60 0.472 0
# 17 2 70 62 0.475 8
# 18 2 80 87 0.475 7
# 19 2 90 87 0.475 3
# 20 2 100 87 0.475 13
# 21 3 10 7 0.874 3
# 22 3 20 7 0.874 13
# 23 3 30 36 0.472 6
# 24 3 40 38 0.202 2
# 25 3 50 51 0.202 1
# 26 3 60 53 0.475 7
# 27 3 70 87 0.320 17
# 28 3 80 87 0.320 7
# 29 3 90 91 0.874 1
# 30 3 100 91 0.874 9
# 31 4 10 10 0.202 0
# 32 4 20 11 0.874 9
# 33 4 30 11 0.874 19
# 34 4 40 61 0.320 21
# 35 4 50 61 0.320 11
# 36 4 60 61 0.320 1
# 37 4 70 72 0.475 2
# 38 4 80 74 0.472 6
# 39 4 90 92 0.320 2
# 40 4 100 95 0.202 5
# 41 5 10 3 0.320 7
# 42 5 20 25 0.320 5
# 43 5 30 31 0.202 1
# 44 5 40 43 0.472 3
# 45 5 50 51 0.202 1
# 46 5 60 60 0.472 0
# 47 5 70 62 0.475 8
# 48 5 80 91 0.874 11
# 49 5 90 91 0.874 1
# 50 5 100 91 0.874 9
This isn't particularly efficient, but gives the desired result. The problem arises when the data frame gets large. Rerunning the above code with a sufficiently large n
produces the following error:
Error: cannot allocate vector of size...
I think this is because the left join is producing an unnecessarily huge data frame. Clearly, join-then-filter isn't the best strategy. But what is the best way to do a fuzzy and non-fuzzy join simultaneously?
In my opinion the data.table package is best suited for this job:
library(data.table)
setDT(df)
setDT(df_alt)
df_alt[df
, on = .(ID, loc)
, roll = "nearest"
, .(ID, loc.x = i.loc, loc.y = x.loc, value, delta = abs(i.loc - x.loc))]
which gives:
ID loc.x loc.y value delta 1: 1 10 7 0.8744985 3 2: 1 20 7 0.8744985 13 3: 1 30 36 0.4724253 6 4: 1 40 38 0.2016645 2 5: 1 50 53 0.4750352 3 6: 1 60 53 0.4750352 7 7: 1 70 72 0.4750352 2 8: 1 80 74 0.4724253 6 9: 1 90 92 0.3202490 2 10: 1 100 95 0.2016645 5 11: 2 10 10 0.2016645 0 12: 2 20 25 0.3202490 5 13: 2 30 31 0.2016645 1 14: 2 40 31 0.2016645 9 15: 2 50 52 0.8744985 2 16: 2 60 60 0.4724253 0 17: 2 70 62 0.4750352 8 18: 2 80 87 0.4750352 7 19: 2 90 87 0.4750352 3 20: 2 100 87 0.4750352 13 21: 3 10 7 0.8744985 3 22: 3 20 7 0.8744985 13 23: 3 30 36 0.4724253 6 24: 3 40 38 0.2016645 2 25: 3 50 51 0.2016645 1 26: 3 60 53 0.4750352 7 27: 3 70 53 0.4750352 17 28: 3 80 87 0.3202490 7 29: 3 90 91 0.8744985 1 30: 3 100 91 0.8744985 9 31: 4 10 10 0.2016645 0 32: 4 20 11 0.8744985 9 33: 4 30 11 0.8744985 19 34: 4 40 61 0.3202490 21 35: 4 50 61 0.3202490 11 36: 4 60 61 0.3202490 1 37: 4 70 72 0.4750352 2 38: 4 80 74 0.4724253 6 39: 4 90 92 0.3202490 2 40: 4 100 95 0.2016645 5 41: 5 10 3 0.3202490 7 42: 5 20 25 0.3202490 5 43: 5 30 31 0.2016645 1 44: 5 40 43 0.4724253 3 45: 5 50 51 0.2016645 1 46: 5 60 60 0.4724253 0 47: 5 70 62 0.4750352 8 48: 5 80 91 0.8744985 11 49: 5 90 91 0.8744985 1 50: 5 100 91 0.8744985 9
来源:https://stackoverflow.com/questions/52974300/simultaneous-fuzzy-and-non-fuzzy-join