How to mutate for loop in dplyr

问题

I want to create multiple lag variables for a column in a data frame for a range of values. I have code that successfully does what I want but is not scalable for what I need (hundreds of iterations)

I have code below that successfully does what I want but is not scalable for what I need (hundreds of iterations)

Lake_Lag <- Lake_Champlain_long.term_monitoring_1992_2016 %>% 
group_by(StationID,Test) %>% 
   arrange(StationID,Test,VisitDate) %>% 
   mutate(lag.Result1 = dplyr::lag(Result, n = 1, default = NA))%>% 
   mutate(lag.Result5 = dplyr::lag(Result, n = 5, default = NA))%>% 
   mutate(lag.Result10 = dplyr::lag(Result, n = 10, default = NA))%>% 
   mutate(lag.Result15 = dplyr::lag(Result, n = 15, default = NA))%>% 
   mutate(lag.Result20 = dplyr::lag(Result, n = 20, default = NA))

I would like to be able to use a list c(1,5,10,15,20) or a range 1:150 to create lagging variables for my data frame.

回答1:

Here's an approach that makes use of some 'tidy eval helpers' included in dplyr that come from the rlang package.

The basic idea is to create a new column in mutate() whose name is based on a string supplied by a for-loop.

library(dplyr)

grouped_data <- Lake_Champlain_long.term_monitoring_1992_2016 %>% 
  group_by(StationID,Test) %>% 
  arrange(StationID,Test,VisitDate)

for (lag_size in c(1, 5, 10, 15, 20)) {

  new_col_name <- paste0("lag_result_", lag_size)

  grouped_data <- grouped_data %>% 
    mutate(!!sym(new_col_name) := lag(Result, n = lag_size, default = NA))
}

The sym(new_col_name) := is a dynamic way of writing lag_result_1 =, lag_result_2 =, etc. when using functions like mutate() or summarize() from the dplyr package.

回答2:

We can use shift from data.table, which can take take multiple valuees for n. According to ?shift

n - Non-negative integer vector denoting the offset to lead or lag the input by. To create multiple lead/lag vectors, provide multiple values to n

Convert the 'data.frame' to 'data.table' (setDT), order by 'StationID', 'Test', 'VisitDate' in i, grouped by 'StationID', 'Test'), get the lag (default type of shift is "lag") of 'Result' with n as a vector of values, and assign (:=) the output to a vector of columns names (created with paste0)

library(data.table)
i1 <- c(1, 5, 10, 15, 20)
setDT(Lake_Champlain_long.term_monitoring_1992_2016)[order(StationID, 
    Test, VisitDate), paste0("lag.Result", i) := shift(Result, n= i),
        by = .(StationID, Test)][]

NOTE: Showed a much efficient solution

来源：https://stackoverflow.com/questions/55940655/how-to-mutate-for-loop-in-dplyr

标签

dplyr