Lets say I have this dataset:
df1 = data.frame(groupID = c(rep("a", 6), rep("b", 6), rep("c", 6)),
testid = c(111, 222, 333, 444, 555, 666, 777, 888, 999, 1010, 1111, 1212, 1313, 1414, 1515, 1616, 1717, 1818))
df1
groupID testid
1 a 111
2 a 222
3 a 333
4 a 444
5 a 555
6 a 666
7 b 777
8 b 888
9 b 999
10 b 1010
11 b 1111
12 b 1212
13 c 1313
14 c 1414
15 c 1515
16 c 1616
17 c 1717
18 c 1818
And I have this 2nd dataset:
df2 = data.frame(groupID = c("a", "a", "a", "a", "b", "b", "b", "c", "c", "c"),
testid = c(222, 333, 555, 666, 777, 999, 1010, 1313, 1616, 1818),
bd = c(1, 1, 2, 2, 0, 1, 1, 1, 1, 2))
df2
groupID testid bd
1 a 222 1
2 a 333 1
3 a 555 2
4 a 666 2
5 b 777 0
6 b 999 1
7 b 1010 1
8 c 1313 1
9 c 1616 1
10 c 1818 2
I want to use the intervals in the 2nd dataset to fill in a new variable in the 1st dataset and autofill in values that have two occurances of a bd
and NAs
everywhere else by group.
Desired output:
groupID testid new_bd
1 a 111 NA
2 a 222 1
3 a 333 1
4 a 444 NA
5 a 555 2
6 a 666 2
7 b 777 0
8 b 888 NA
9 b 999 1
10 b 1010 1
11 b 1111 NA
12 b 1212 NA
13 c 1313 1
14 c 1414 1
15 c 1515 1
16 c 1616 1
17 c 1717 NA
18 c 1818 2
Ideally would like dplyr
/tidyr
solution but open to any approaches.
similar but these fill all values: R: Filling timeseries values but only within last 12 months
I would start by modifying df2 to start and end of range. And you can loop or do anything else after.
grps <- df2 %>% group_by(groupID, bd) %>% summarize(start = min(testid), end = max(testid))
grps
groupID bd start end
<fct> <dbl> <dbl> <dbl>
1 a 1 222 333
2 a 2 555 666
3 b 0 777 777
4 b 1 999 1010
5 c 1 1313 1616
6 c 2 1818 1818
df1$bd <- NA
for(i in 1:nrow(grps)){
df1$bd[which(df1$test >= grps$start[i] & df1$test <= grps$end[i])] = grps$bd[i]
}
df1
groupID testid bd
1 a 111 NA
2 a 222 1
3 a 333 1
4 a 444 NA
5 a 555 2
6 a 666 2
7 b 777 0
8 b 888 NA
9 b 999 1
10 b 1010 1
11 b 1111 NA
12 b 1212 NA
13 c 1313 1
14 c 1414 1
15 c 1515 1
16 c 1616 1
17 c 1717 NA
18 c 1818 2
Maybe I have overlooked a simpler method but here is what I came up with using dplyr
, we first create a left_join
between df1
and df2
and fill
bd
column. We then group_by
group_ID
and bd
and get first and last index of non-NA value in each group and replace values to NA
which are less than minimum index and greater than maximum index.
library(dplyr)
left_join(df1, df2, by = c("groupID", "testid")) %>%
mutate(bd1 = bd) %>%
tidyr::fill(bd) %>%
group_by(groupID, bd) %>%
mutate(minRow = if (all(is.na(bd))) 1 else first(which(!is.na(bd1))),
maxRow = if (all(is.na(bd))) n() else last(which(!is.na(bd1))),
new_bd = replace(bd, is.na(bd1) & (row_number() < minRow |
row_number() > maxRow), NA)) %>%
ungroup() %>%
select(names(df1), new_bd)
# groupID testid new_bd
# <fct> <dbl> <dbl>
# 1 a 111 NA
# 2 a 222 1
# 3 a 333 1
# 4 a 444 NA
# 5 a 555 2
# 6 a 666 2
# 7 b 777 0
# 8 b 888 NA
# 9 b 999 1
#10 b 1010 1
#11 b 1111 NA
#12 b 1212 NA
#13 c 1313 1
#14 c 1414 1
#15 c 1515 1
#16 c 1616 1
#17 c 1717 NA
#18 c 1818 2
Here is a solution that works on my test data example above but wont run on my large dataset where I run into the problem of Error: cannot allocate vector of size 45.5 Gb
. I believe it is related to the problem outlined here:"The same size explosion can happen if you have lots of the same level in both with otherwise different rows". In my actual dataset I'm looking at date variables, I didn't think this would effect the problem but maybe it does. I'm not sure if there is a work using fuzzyjoin
as it works on a subset of the data.
library(tidyverse)
library(fuzzyjoin)
library(tidylog)
grps <- df2 %>% group_by(groupID, bd) %>% summarize(start = min(testid), end = max(testid))
grps
df1 %>%
fuzzy_left_join(grps,
by = c("groupID" = "groupID",
"testid" = "start",
"testid" = "end"),
match_fun = list(`==`, `>=`, `<=`)) %>%
select(groupID = groupID.x, testid, bd, start, end)
select: dropped 2 variables (groupID.x, groupID.y)
groupID testid bd start end
1 a 111 NA NA NA
2 a 222 1 222 333
3 a 333 1 222 333
4 a 444 NA NA NA
5 a 555 2 555 666
6 a 666 2 555 666
7 b 777 0 777 777
8 b 888 NA NA NA
9 b 999 1 999 1010
10 b 1010 1 999 1010
11 b 1111 NA NA NA
12 b 1212 NA NA NA
13 c 1313 1 1313 1616
14 c 1414 1 1313 1616
15 c 1515 1 1313 1616
16 c 1616 1 1313 1616
17 c 1717 NA NA NA
18 c 1818 2 1818 1818
data.table
solution:
library(data.table)
> new <- setDT(grps)[setDT(df1),
+ .(groupID, testid, x.start, x.end, x.bd),
+ on = .(groupID, start <= testid, end >= testid)]
> new
groupID testid x.start x.end x.bd
1: a 111 NA NA NA
2: a 222 222 333 1
3: a 333 222 333 1
4: a 444 NA NA NA
5: a 555 555 666 2
6: a 666 555 666 2
7: b 777 777 777 0
8: b 888 NA NA NA
9: b 999 999 1010 1
10: b 1010 999 1010 1
11: b 1111 NA NA NA
12: b 1212 NA NA NA
13: c 1313 1313 1616 1
14: c 1414 1313 1616 1
15: c 1515 1313 1616 1
16: c 1616 1313 1616 1
17: c 1717 NA NA NA
18: c 1818 1818 1818 2
I think it may be done in fuzzyjoin
using internal_join
but I'm not sure?: https://github.com/dgrtwo/fuzzyjoin/issues/50
来源:https://stackoverflow.com/questions/57904212/r-fill-new-column-based-on-interval-from-another-dataset-lookup