问题
I wish to use a recursive structure in my dplyr change that iterates on the number of lags used on certain operations. The thing is that I am not sure how to set its ending point since it resembles more a while
than a for
loop, which makes me a bit insecure.
Here is some sample data. Groups are not necessarily the same size and are indexed by id
df <- data.frame(id = c(1, 1, 1, 1, 2,
2, 3, 4, 5, 5, 5),
p201 = c(NA, NA, "001", NA, NA, NA, "001", "001", "001", NA, NA),
V2009 = c(25, 11, 63, 75, 49, 14, 32, 31, 3, 10, 3),
ager = c(2.3, 2, 8.1, 12.1, 5.1, 2, 2.9, 2.8, 2,
2, 2),
V2007 = c(1, 1, 1, 1, 2, 2, 1, 2, 1, 1, 1)
)
I wish to update p201
according to how similar to its lags is an observation in a given group.
This is how I would do in a first iteration:
new <- df %>%
group_by(id) %>%
mutate(
p201 = ifelse(!is.na(p201), p201,
ifelse(
V2007 == lag(V2007, 1) &
abs(V2009 - lag(V2009, 1)) <= ager,
first(na.omit(p201)), p201)))
My question is how could I write a recursive function that fits in a dplyr
chain that iterates on i
in lag(VAR, i)
- I want i
to grow until either thing happens: there are no more NA
s in p201
and all possible lags are tried in each group. Regarding the latter, it should be said that the number of rows in each group varies.
I thought about two possibilities: making the maximum value of i
be the number of rows of the largest group - 1 or it being the number of rows of each group - 1. I'm not sure what solution is the optimal one, nor do I know how to implement this.
Could somebody help?
Here is the desired output:
# A tibble: 11 x 5
# Groups: id [5]
id p201 V2009 ager V2007
<int> <chr> <dbl> <dbl> <dbl>
1 1 NA 25 2.3 1
2 1 NA 11 2 1
3 1 001 63 8.1 1
4 1 001 75 12.2 1
5 2 NA 49 5.1 2
6 2 NA 14 2 2
7 3 001 32 2.9 1
8 4 001 31 2.8 2
9 5 001 3 2 1
10 5 NA 10 2 1
11 5 001 3 2 1
回答1:
I don't think what you are describing is really recursive, in that the calculations don't depend on the results of previous iterations. It is, however, fairly complex, and perhaps the best way to fit it into a dplyr
pipeline is to declare a function that takes the necessary variables and returns your answer.
Here is a function that does the trick. It uses the split-lapply-merge paradigm to force the calculations to work properly row-wise. It then uses an sapply to check whether, for each row, the logical conditions are met in any previous row in the group. If so, it overwrites an NA
in that rows p201
value with a non-NA value:
multi_condition <- function(id, v1, v2, v3, v4)
{
unlist(lapply(split(data.frame(v1, v2, v3, v4), id), function(x)
{
if(all(is.na(x$v1))) return(x$v1)
ss <- unlist(c(FALSE, sapply(seq_along(x$v2)[-1], function(i)
{
x$v2[i] %in% x$v2[1:(i - 1)] & any(abs(x$v3[i] - x$v3[1:(i - 1)]) <= x$v4[i])
})))
replace(x$v1, ss, x$v1[!is.na(x$v1)][1])
}))
}
So the function itself is complex, but its use is straightforward:
library(dplyr)
df %>%
group_by(id) %>%
mutate(p201 = multi_condition(id, p201, V2007, V2009, ager))
#> # A tibble: 11 x 5
#> # Groups: id [5]
#> id p201 V2009 ager V2007
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 1 <NA> 25 2.3 1
#> 2 1 <NA> 11 2 1
#> 3 1 001 63 8.1 1
#> 4 1 001 75 12.1 1
#> 5 2 <NA> 49 5.1 2
#> 6 2 <NA> 14 2 2
#> 7 3 001 32 2.9 1
#> 8 4 001 31 2.8 2
#> 9 5 001 3 2 1
#> 10 5 <NA> 10 2 1
#> 11 5 001 3 2 1
If you prefer a more dplyr
- type solution using group_map
, with the logic perhaps a little clearer, you could try:
multi_select <- function(df, ...)
{
rowwise_logic <- function(i)
{
if(i == 1) return(FALSE)
j <- 1:(i - 1)
df$V2007[i] %in% df$V2007[j] &
any(abs(df$V2009[i] - df$V2009[j]) <= df$ager[i])
}
matching_rows <- sapply(seq(nrow(df)), rowwise_logic)
df$p201[matching_rows] <- first(na.exclude(df$p201))
return(df)
}
Which would work like this:
df %>%
group_by(id) %>%
group_map(multi_select, .keep = TRUE) %>%
bind_rows()
Created on 2020-07-15 by the reprex package (v0.3.0)
回答2:
You can accomplish what you want with 2 group_bys
, first on (id, V2007)
and then after creating a dummy variable, counter
, on (id, V2007, counter)
. The idea behind counter
is to sub-divide records in id, V2007
when p201 == 001
. See the dummy example below
id | p201 | V2007 | counter
1 | NA | 1 | 0
1 | 001 | 1 | 1 <= (+1 to counter)
1 | NA | 1 | 1
1 | 001 | 1 | 2 <= (+1 to counter)
After the second group_by
this is subdivided into
id | p201 | V2007 | counter
1 | NA | 1 | 0 (group 1-A OR 1)
----------------------------
1 | 001 | 1 | 1 (group 1-B OR 2)
1 | NA | 1 | 1
----------------------------
1 | 001 | 1 | 2 (group 1-C OR 3)
After the second group_by
, p201
will 'copy' a non-NA value if the row matches 3 following conditions
p201
ISNA
- IS NOT FIRST ROW OF SUB-GROUP
cond1 = row_number() > 1
- ABS(
V2009
- FIRST(V2009
)) <=AGER
cond2 = abs(V2009 - first(V2009)) <= ager
See solution
library(dplyr)
df %>%
mutate(p201 = as.character(p201)) %>%
group_by(id, V2007) %>%
mutate(counter = cumsum(ifelse(is.na(p201), 0, p201))) %>%
group_by(id, V2007, counter) %>%
mutate(cond1 = row_number() > 1) %>%
mutate(cond2 = abs(V2009 - first(V2009)) <= ager) %>%
mutate(p201 = ifelse(is.na(p201) & cond1 & cond2, first(p201), p201)) %>%
ungroup() %>%
select(-counter, -cond1, -cond2)
# A tibble: 11 x 5
id p201 V2009 ager V2007
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 NA 25 2.3 1
2 1 NA 11 2 1
3 1 001 63 8.1 1
4 1 001 75 12.1 1
5 2 NA 49 5.1 2
6 2 NA 14 2 2
7 3 001 32 2.9 1
8 4 001 31 2.8 2
9 5 001 3 2 1
10 5 NA 10 2 1
11 5 001 3 2 1
A more detailed look into the solution - if I exclude the last 2 lines, you can see the new columns that were created
# A tibble: 11 x 8
# Groups: id, V2007, counter [6]
id p201 V2009 ager V2007 counter cond1 cond2
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
1 1 NA 25 2.3 1 0 FALSE TRUE
2 1 NA 11 2 1 0 TRUE FALSE
3 1 001 63 8.1 1 1 FALSE TRUE
4 1 001 75 12.1 1 1 TRUE TRUE
5 2 NA 49 5.1 2 0 FALSE TRUE
6 2 NA 14 2 2 0 TRUE FALSE
7 3 001 32 2.9 1 1 FALSE TRUE
8 4 001 31 2.8 2 1 FALSE TRUE
9 5 001 3 2 1 1 FALSE TRUE
10 5 NA 10 2 1 1 TRUE FALSE
11 5 001 3 2 1 1 TRUE TRUE
Let's first look at counter
- created after first grouping on id, V2007
# A tibble: 11 x 8
# Groups: id, V2007, counter [6]
id p201 V2009 ager V2007 counter
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
------------- GROUP 1 -----------------
1 1 NA 25 2.3 1 0
2 1 NA 11 2 1 0
3 1 001 63 8.1 1 1 <= (+1 when p201 == '001')
4 1 NA 75 12.1 1 1
------------- GROUP 2 -----------------
5 2 NA 49 5.1 2 0
6 2 NA 14 2 2 0
------------- GROUP 3 -----------------
7 3 001 32 2.9 1 1 <= (+1 when p201 == '001')
-------------- GROUP 4 ----------------
8 4 001 31 2.8 2 1 <= (+1 when p201 == '001')
etc
Now let's look at cond1
created after 2nd grouping on id, V2007, counter
# A tibble: 11 x 8
# Groups: id, V2007, counter [6]
id p201 V2009 ager V2007 counter cond1
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <lgl>
---------------- GROUP 1 --------------------
1 1 NA 25 2.3 1 0 FALSE <= ROW_NUMBER == 1 => FALSE
2 1 NA 11 2 1 0 TRUE <= ROW_NUMBER > 1 => TRUE
---------------- GROUP 2 --------------------
3 1 001 63 8.1 1 1 FALSE <= ROW_NUMBER == 1 => FALSE
4 1 NA 75 12.1 1 1 TRUE <= ROW_NUMBER > 1 => TRUE
---------------- GROUP 3 --------------------
5 2 NA 49 5.1 2 0 FALSE <= ROW_NUMBER == 1 => FALSE
6 2 NA 14 2 2 0 TRUE <= ROW_NUMBER > 1 => TRUE
<skip>
---------------- GROUP N --------------------
9 5 001 3 2 1 1 FALSE <= ROW_NUMBER == 1 => FALSE
10 5 NA 10 2 1 1 TRUE <= ROW_NUMBER > 1 => TRUE
11 5 NA 3 2 1 1 TRUE <= ROW_NUMBER > 1 => TRUE
Finally look at cond2
- abs(V2009 - first(V2009)) <= ager
# A tibble: 11 x 8
# Groups: id, V2007, counter [6]
id p201 V2009 ager V2007 counter cond1 cond2
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
---------------- GROUP 1 ------------------------ first(V2009) in this group is 25
1 1 NA 25 2.3 1 0 FALSE TRUE <= abs(25 - 25) <= 2.3 => TRUE
2 1 NA 11 2 1 0 TRUE FALSE <= abs(11 - 25) <= 2 => FALSE
---------------- GROUP 2 ------------------------ first(V2009) in this group is 63
3 1 001 63 8.1 1 1 FALSE TRUE <= abs(63 - 63) <= 8.1 => TRUE
4 1 NA 75 12.1 1 1 TRUE TRUE <= abs(75 - 63) <= 12.1 => TRUE
---------------- GROUP 3 ------------------------ <= first(V2009) in this group is 49
5 2 NA 49 5.1 2 0 FALSE TRUE <= abs(49 - 49) <= 5.1 => TRUE
6 2 NA 14 2 2 0 TRUE FALSE <= abs(14 - 49) <= 2 => FALSE
<skip>
---------------- GROUP N ------------------------ <= first(V2009) in this group is 3
9 5 001 3 2 1 1 FALSE TRUE <= abs(3 - 3) <= 2 => TRUE
10 5 NA 10 2 1 1 TRUE FALSE <= abs(10 - 3) <= 2 => FALSE
11 5 NA 3 2 1 1 TRUE TRUE <= abs(3 - 3) <= 2 => TRUE
Finally, ifelse(is.na(p201) & cond1 & cond2, first(p201), p201)
. This statement translates to, 'IF p201 IS NA AND COND1 == TRUE AND COND2 == TRUE, THEN ASSIGN P201 = FIRST(P201), ELSE P201 DOES NOT CHANGE'
# A tibble: 11 x 8
# Groups: id, V2007, counter [6]
id p201 V2009 ager V2007 counter cond1 cond2
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
---------------- GROUP 1 ------------------------
1 1 NA 25 2.3 1 0 FALSE TRUE <= P201 DOES NOT CHANGE BECAUSE COND1 IS FALSE
2 1 NA 11 2 1 0 TRUE FALSE <= P201 DOES NOT CHANGE BECAUSE COND2 IS FALSE
---------------- GROUP 2 ------------------------ <= FIRST(P201) == '001' FOR THIS GROUP
3 1 001 63 8.1 1 1 FALSE TRUE <= P201 DOES NOT CHANGE BECAUSE P201 == '001' AND COND1 IS FALSE
4 1 001 75 12.1 1 1 TRUE TRUE <= P201 = FIRST(P201) BECAUSE ALL 3 CONDITIONS ARE TRUE (P201 WAS ORIGINALLY NA HERE)
---------------- GROUP 3 ------------------------
5 2 NA 49 5.1 2 0 FALSE TRUE <= P201 DOES NOT CHANGE BECAUSE COND1 IS FALSE
6 2 NA 14 2 2 0 TRUE FALSE <= P201 DOES NOT CHANGE BECAUSE COND2 IS FALSE
<skip>
---------------- GROUP N ------------------------
9 5 001 3 2 1 1 FALSE TRUE <= P201 DOES NOT CHANGE BECAUSE COND1 IS FALSE
10 5 NA 10 2 1 1 TRUE FALSE <= P201 DOES NOT CHANGE BECAUSE COND2 IS FALSE
11 5 001 3 2 1 1 TRUE TRUE <= P201 = FIRST(P201) BECAUSE ALL 3 CONDITIONS ARE TRUE (P201 WAS ORIGINALLY NA HERE)
Hopefully that helps.
I added
mutate(p201 = as.character(p201))
because it converts p201
into an integer otherwise.
来源:https://stackoverflow.com/questions/62869730/recursive-function-in-r-with-ending-point-varying-by-group