Recursive function in R with ending point varying by group

问题

I wish to use a recursive structure in my dplyr change that iterates on the number of lags used on certain operations. The thing is that I am not sure how to set its ending point since it resembles more a while than a for loop, which makes me a bit insecure.

Here is some sample data. Groups are not necessarily the same size and are indexed by id

df <- data.frame(id = c(1, 1, 1, 1, 2, 
                        2, 3, 4, 5, 5, 5), 
                  p201 = c(NA, NA, "001", NA, NA, NA, "001", "001", "001", NA, NA), 
                 V2009 = c(25, 11, 63, 75, 49, 14, 32, 31, 3, 10, 3),
                 ager = c(2.3, 2, 8.1, 12.1, 5.1, 2, 2.9, 2.8, 2,
                          2, 2), 
                 V2007 = c(1, 1, 1, 1, 2, 2, 1, 2, 1, 1, 1)
)

I wish to update p201 according to how similar to its lags is an observation in a given group.

This is how I would do in a first iteration:

new <- df %>%
group_by(id) %>%
mutate(
    p201 = ifelse(!is.na(p201), p201,
                      ifelse(
                        V2007 == lag(V2007, 1) & 
                        abs(V2009 - lag(V2009, 1)) <= ager,
                        first(na.omit(p201)), p201)))

My question is how could I write a recursive function that fits in a dplyr chain that iterates on i in lag(VAR, i) - I want i to grow until either thing happens: there are no more NAs in p201 and all possible lags are tried in each group. Regarding the latter, it should be said that the number of rows in each group varies.

I thought about two possibilities: making the maximum value of i be the number of rows of the largest group - 1 or it being the number of rows of each group - 1. I'm not sure what solution is the optimal one, nor do I know how to implement this.

Could somebody help?

Here is the desired output:

# A tibble: 11 x 5
# Groups:   id [5]
      id p201  V2009  ager V2007
   <int> <chr> <dbl> <dbl> <dbl>
 1    1 NA       25  2.3      1
 2    1 NA       11  2        1
 3    1 001      63  8.1      1
 4    1 001      75 12.2      1
 5    2 NA       49  5.1      2
 6    2 NA       14  2        2
 7    3 001      32  2.9      1
 8    4 001      31  2.8      2
 9    5 001       3  2        1
10    5 NA       10  2        1
11    5 001       3  2        1

回答1:

I don't think what you are describing is really recursive, in that the calculations don't depend on the results of previous iterations. It is, however, fairly complex, and perhaps the best way to fit it into a dplyr pipeline is to declare a function that takes the necessary variables and returns your answer.

Here is a function that does the trick. It uses the split-lapply-merge paradigm to force the calculations to work properly row-wise. It then uses an sapply to check whether, for each row, the logical conditions are met in any previous row in the group. If so, it overwrites an NA in that rows p201 value with a non-NA value:

multi_condition <- function(id, v1, v2, v3, v4)
{
  unlist(lapply(split(data.frame(v1, v2, v3, v4), id), function(x) 
  {
    if(all(is.na(x$v1))) return(x$v1)
    
    ss <- unlist(c(FALSE, sapply(seq_along(x$v2)[-1], function(i) 
    {
      x$v2[i] %in% x$v2[1:(i - 1)] & any(abs(x$v3[i] - x$v3[1:(i - 1)]) <= x$v4[i])
    })))   
    replace(x$v1, ss, x$v1[!is.na(x$v1)][1])    
  }))
}

So the function itself is complex, but its use is straightforward:

library(dplyr)

df %>%
  group_by(id) %>%
  mutate(p201 = multi_condition(id, p201, V2007, V2009, ager))
#> # A tibble: 11 x 5
#> # Groups:   id [5]
#>       id p201  V2009  ager V2007
#>    <dbl> <chr> <dbl> <dbl> <dbl>
#>  1     1 <NA>     25   2.3     1
#>  2     1 <NA>     11   2       1
#>  3     1 001      63   8.1     1
#>  4     1 001      75  12.1     1
#>  5     2 <NA>     49   5.1     2
#>  6     2 <NA>     14   2       2
#>  7     3 001      32   2.9     1
#>  8     4 001      31   2.8     2
#>  9     5 001       3   2       1
#> 10     5 <NA>     10   2       1
#> 11     5 001       3   2       1

If you prefer a more dplyr - type solution using group_map, with the logic perhaps a little clearer, you could try:

multi_select <- function(df, ...) 
{
  rowwise_logic <- function(i) 
  {
    if(i == 1) return(FALSE)
    j <- 1:(i - 1)
    df$V2007[i] %in% df$V2007[j] & 
    any(abs(df$V2009[i] - df$V2009[j]) <= df$ager[i])
  }
  
  matching_rows <- sapply(seq(nrow(df)), rowwise_logic)  
  df$p201[matching_rows] <- first(na.exclude(df$p201))

  return(df)
}

Which would work like this:

df %>% 
  group_by(id) %>%
  group_map(multi_select, .keep = TRUE) %>%
  bind_rows()

^{Created on 2020-07-15 by the reprex package (v0.3.0)}

回答2:

You can accomplish what you want with 2 group_bys, first on (id, V2007) and then after creating a dummy variable, counter, on (id, V2007, counter). The idea behind counter is to sub-divide records in id, V2007 when p201 == 001. See the dummy example below

id | p201 | V2007 | counter
 1 |   NA |     1 |       0
 1 |  001 |     1 |       1     <= (+1 to counter)
 1 |   NA |     1 |       1
 1 |  001 |     1 |       2     <= (+1 to counter)

After the second group_by this is subdivided into

id | p201 | V2007 | counter
 1 |   NA |     1 |       0   (group 1-A OR 1)
----------------------------
 1 |  001 |     1 |       1   (group 1-B OR 2)
 1 |   NA |     1 |       1
----------------------------
 1 |  001 |     1 |       2   (group 1-C OR 3)

After the second group_by, p201 will 'copy' a non-NA value if the row matches 3 following conditions

p201 IS NA
IS NOT FIRST ROW OF SUB-GROUP
- cond1 = row_number() > 1
ABS(V2009 - FIRST(V2009)) <= AGER
- cond2 = abs(V2009 - first(V2009)) <= ager

See solution

library(dplyr)
df %>%
    mutate(p201 = as.character(p201)) %>%
    group_by(id, V2007) %>% 
    mutate(counter = cumsum(ifelse(is.na(p201), 0, p201))) %>%
    group_by(id, V2007, counter) %>%
    mutate(cond1 = row_number() > 1) %>%
    mutate(cond2 = abs(V2009 - first(V2009)) <= ager) %>%
    mutate(p201 = ifelse(is.na(p201) & cond1 & cond2, first(p201), p201)) %>%
    ungroup() %>%
    select(-counter, -cond1, -cond2)

# A tibble: 11 x 5
      id p201  V2009  ager V2007
   <dbl> <chr> <dbl> <dbl> <dbl>
 1     1 NA       25   2.3     1
 2     1 NA       11   2       1
 3     1 001      63   8.1     1
 4     1 001      75  12.1     1
 5     2 NA       49   5.1     2
 6     2 NA       14   2       2
 7     3 001      32   2.9     1
 8     4 001      31   2.8     2
 9     5 001       3   2       1
10     5 NA       10   2       1
11     5 001       3   2       1

A more detailed look into the solution - if I exclude the last 2 lines, you can see the new columns that were created

# A tibble: 11 x 8
# Groups:   id, V2007, counter [6]
      id p201  V2009  ager V2007 counter cond1 cond2
   <dbl> <chr> <dbl> <dbl> <dbl>   <dbl> <lgl> <lgl>
 1     1 NA       25   2.3     1       0 FALSE TRUE
 2     1 NA       11   2       1       0 TRUE  FALSE
 3     1 001      63   8.1     1       1 FALSE TRUE
 4     1 001      75  12.1     1       1 TRUE  TRUE
 5     2 NA       49   5.1     2       0 FALSE TRUE
 6     2 NA       14   2       2       0 TRUE  FALSE
 7     3 001      32   2.9     1       1 FALSE TRUE
 8     4 001      31   2.8     2       1 FALSE TRUE
 9     5 001       3   2       1       1 FALSE TRUE
10     5 NA       10   2       1       1 TRUE  FALSE
11     5 001       3   2       1       1 TRUE  TRUE

Let's first look at counter - created after first grouping on id, V2007

# A tibble: 11 x 8
# Groups:   id, V2007, counter [6]
      id p201  V2009  ager V2007 counter 
   <dbl> <chr> <dbl> <dbl> <dbl>   <dbl> 
 ------------- GROUP 1 -----------------
 1     1 NA       25   2.3     1       0 
 2     1 NA       11   2       1       0 
 3     1 001      63   8.1     1       1   <= (+1 when p201 == '001')
 4     1 NA       75  12.1     1       1  
 ------------- GROUP 2 -----------------
 5     2 NA       49   5.1     2       0  
 6     2 NA       14   2       2       0 
 ------------- GROUP 3 -----------------
 7     3 001      32   2.9     1       1   <= (+1 when p201 == '001')
 -------------- GROUP 4 ----------------
 8     4 001      31   2.8     2       1   <= (+1 when p201 == '001')
 etc

Now let's look at cond1 created after 2nd grouping on id, V2007, counter

# A tibble: 11 x 8
# Groups:   id, V2007, counter [6]
      id p201  V2009  ager V2007 counter cond1 
   <dbl> <chr> <dbl> <dbl> <dbl>   <dbl> <lgl> 
 ---------------- GROUP 1 -------------------- 
 1     1 NA       25   2.3     1       0 FALSE   <= ROW_NUMBER == 1 => FALSE
 2     1 NA       11   2       1       0 TRUE    <= ROW_NUMBER > 1 => TRUE
 ---------------- GROUP 2 --------------------
 3     1 001      63   8.1     1       1 FALSE   <= ROW_NUMBER == 1 => FALSE
 4     1 NA      75  12.1     1        1 TRUE    <= ROW_NUMBER > 1 => TRUE
 ---------------- GROUP 3 -------------------- 
 5     2 NA       49   5.1     2       0 FALSE   <= ROW_NUMBER == 1 => FALSE
 6     2 NA       14   2       2       0 TRUE    <= ROW_NUMBER > 1 => TRUE
 <skip>
 ---------------- GROUP N --------------------
 9     5 001       3   2       1       1 FALSE   <= ROW_NUMBER == 1 => FALSE
10     5 NA       10   2       1       1 TRUE    <= ROW_NUMBER > 1 => TRUE
11     5 NA        3   2       1       1 TRUE    <= ROW_NUMBER > 1 => TRUE

Finally look at cond2 - abs(V2009 - first(V2009)) <= ager

# A tibble: 11 x 8
# Groups:   id, V2007, counter [6]
      id p201  V2009  ager V2007 counter cond1 cond2
   <dbl> <chr> <dbl> <dbl> <dbl>   <dbl> <lgl> <lgl>
 ---------------- GROUP 1 ------------------------      first(V2009) in this group is 25
 1     1 NA       25   2.3     1       0 FALSE TRUE     <= abs(25 - 25) <= 2.3 => TRUE
 2     1 NA       11   2       1       0 TRUE  FALSE    <= abs(11 - 25) <= 2 => FALSE

 ---------------- GROUP 2 ------------------------      first(V2009) in this group is 63
 3     1 001      63   8.1     1       1 FALSE TRUE     <= abs(63 - 63) <= 8.1 => TRUE
 4     1 NA       75  12.1     1       1 TRUE  TRUE     <= abs(75 - 63) <= 12.1 => TRUE

 ---------------- GROUP 3 ------------------------      <= first(V2009) in this group is 49
 5     2 NA       49   5.1     2       0 FALSE TRUE     <= abs(49 - 49) <= 5.1 => TRUE
 6     2 NA       14   2       2       0 TRUE  FALSE    <= abs(14 - 49) <= 2 => FALSE
 <skip>

 ---------------- GROUP N ------------------------      <= first(V2009) in this group is 3
 9     5 001       3   2       1       1 FALSE TRUE     <= abs(3 - 3) <= 2 => TRUE
10     5 NA       10   2       1       1 TRUE  FALSE    <= abs(10 - 3) <= 2 => FALSE
11     5 NA        3   2       1       1 TRUE  TRUE     <= abs(3 - 3) <= 2 => TRUE

Finally, ifelse(is.na(p201) & cond1 & cond2, first(p201), p201). This statement translates to, 'IF p201 IS NA AND COND1 == TRUE AND COND2 == TRUE, THEN ASSIGN P201 = FIRST(P201), ELSE P201 DOES NOT CHANGE'

# A tibble: 11 x 8
# Groups:   id, V2007, counter [6]
      id p201  V2009  ager V2007 counter cond1 cond2
   <dbl> <chr> <dbl> <dbl> <dbl>   <dbl> <lgl> <lgl>
---------------- GROUP 1 ------------------------
 1     1 NA       25   2.3     1       0 FALSE TRUE     <= P201 DOES NOT CHANGE BECAUSE COND1 IS FALSE
 2     1 NA       11   2       1       0 TRUE  FALSE    <= P201 DOES NOT CHANGE BECAUSE COND2 IS FALSE

---------------- GROUP 2 ------------------------       <= FIRST(P201) == '001' FOR THIS GROUP
 3     1 001      63   8.1     1       1 FALSE TRUE     <= P201 DOES NOT CHANGE BECAUSE P201 == '001' AND COND1 IS FALSE
 4     1 001      75  12.1     1       1 TRUE  TRUE     <= P201 = FIRST(P201) BECAUSE ALL 3 CONDITIONS ARE TRUE (P201 WAS ORIGINALLY NA HERE)

---------------- GROUP 3 ------------------------
 5     2 NA       49   5.1     2       0 FALSE TRUE    <= P201 DOES NOT CHANGE BECAUSE COND1 IS FALSE
 6     2 NA       14   2       2       0 TRUE  FALSE   <= P201 DOES NOT CHANGE BECAUSE COND2 IS FALSE
 <skip>

---------------- GROUP N ------------------------
 9     5 001       3   2       1       1 FALSE TRUE    <= P201 DOES NOT CHANGE BECAUSE COND1 IS FALSE
10     5 NA       10   2       1       1 TRUE  FALSE   <= P201 DOES NOT CHANGE BECAUSE COND2 IS FALSE
11     5 001       3   2       1       1 TRUE  TRUE    <= P201 = FIRST(P201) BECAUSE ALL 3 CONDITIONS ARE TRUE (P201 WAS ORIGINALLY NA HERE)

Hopefully that helps.

I added

mutate(p201 = as.character(p201))

because it converts p201 into an integer otherwise.

来源：https://stackoverflow.com/questions/62869730/recursive-function-in-r-with-ending-point-varying-by-group

标签

function

recursion