Retain and lag function in R as SAS

后端 未结 4 605
礼貌的吻别
礼貌的吻别 2021-01-05 10:05

I am looking for a function in R similar to lag1, lag2 and retain functions in SAS which I can use with data.tables.

I know th

相关标签:
4条回答
  • 2021-01-05 10:26

    You have to be aware that R works very different from the data step in SAS. The lag function in SAS is used in the data step, and is used within the implicit loop structure of that data step. The same goes for the retain function, which simply keeps the value constant when going through the data looping.

    R on the other hand works completely vectorized. This means that you have to rethink what you want to do, and adapt accordingly.

    • retain is simply useless in R, as R recycles arguments by default. If you want to do this explicitly, you might look at eg rep() to construct a vector with constant values and a certain length.
    • lag is a matter of using indices, and just shifting position of all values in a vector. In order to keep a vector of the same length, you need to add some NA and remove some extra values.

    A simple example: This SAS code lags a variable x and adds a variable year that has a constant value:

    data one;
       retain year 2013;
       input x @@;
       y=lag1(x);
       z=lag2(x);
       datalines;
    1 2 3 4 5 6
    ;
    

    In R, you could write your own lag function like this:

    mylag <- function(x,k) c(rep(NA,k),head(x,-k))
    

    This single line adds k times NA at the beginning of the vector, and drops the last k values from the vector. The result is a lagged vector as given by lag1 etc. in SAS.

    this allows something like :

    nrs <- 1:6 # equivalent to datalines
    one <- data.frame(
       x = nrs,
       y = mylag(nrs,1),
       z = mylag(nrs,2),
       year = 2013  # R automatically loops, so no extra command needed
    )
    

    The result is :

    > one
      x  y  z year
    1 1 NA NA 2013
    2 2  1 NA 2013
    3 3  2  1 2013
    4 4  3  2 2013
    5 5  4  3 2013
    6 6  5  4 2013
    

    Exactly the same would work with a data.table object. The important note here is to rethink your strategy: Instead of thinking loopwise as you do with the DATA step in SAS, you have to start thinking in terms of vectors and indices when using R.

    0 讨论(0)
  • 2021-01-05 10:38

    I would say the closet equivalent to retain, lag1, and lag2 would be the Lag function in the quantmod package.

    It's very easy to use with data.tables. E.g.:

    library(data.table)
    library(quantmod)
    d <- data.table(v1=c(rep('a', 10), rep('b', 10)), v2=1:20)
    setkeyv(d, 'v1')
    d[,new_var := Lag(v2, 1), by='v1']
    d[,new_var2 := v2-Lag(v2, 3), by='v1']
    d[,new_var3 := Next(v2, 2), by='v1']
    

    This yields the following:

    print(d)
        v1 v2 new_var new_var2 new_var3
     1:  a  1      NA       NA        3
     2:  a  2       1       NA        4
     3:  a  3       2       NA        5
     4:  a  4       3        3        6
     5:  a  5       4        3        7
     6:  a  6       5        3        8
     7:  a  7       6        3        9
     8:  a  8       7        3       10
     9:  a  9       8        3       NA
    10:  a 10       9        3       NA
    11:  b 11      NA       NA       13
    12:  b 12      11       NA       14
    13:  b 13      12       NA       15
    14:  b 14      13        3       16
    15:  b 15      14        3       17
    16:  b 16      15        3       18
    17:  b 17      16        3       19
    18:  b 18      17        3       20
    19:  b 19      18        3       NA
    20:  b 20      19        3       NA
    

    As you can see, Lag lets you look back and Next lets you look forward. Both functions are nice because they pad the result with NAs such that it has the same length as the input.

    If you want to get even fancier, and higher-performance, you can look into rolling joins with data.table objects. This is a little bit different thab what you are asking for, but is conceptually related, and so powerful and awesome I have to share.

    Start with a data.table:

    library(data.table)
    library(quantmod)
    set.seed(42)
    d1 <- data.table(
        id=c(rep('a', 10), rep('b', 10)), 
        time=rep(1:10,2), 
        value=runif(20))
    setkeyv(d1, c('id', 'time'))
    print(d1)
    
        id time     value
     1:  a    1 0.9148060
     2:  a    2 0.9370754
     3:  a    3 0.2861395
     4:  a    4 0.8304476
     5:  a    5 0.6417455
     6:  a    6 0.5190959
     7:  a    7 0.7365883
     8:  a    8 0.1346666
     9:  a    9 0.6569923
    10:  a   10 0.7050648
    11:  b    1 0.4577418
    12:  b    2 0.7191123
    13:  b    3 0.9346722
    14:  b    4 0.2554288
    15:  b    5 0.4622928
    16:  b    6 0.9400145
    17:  b    7 0.9782264
    18:  b    8 0.1174874
    19:  b    9 0.4749971
    20:  b   10 0.5603327
    

    You have another data.table you want to join, but not all time indexes are present in the second table:

    d2 <- data.table(
            id=sample(c('a', 'b'), 5, replace=TRUE), 
            time=sample(1:10, 5), 
            value2=runif(5))
    setkeyv(d2, c('id', 'time'))
    print(d2)
       id time      value2
    1:  a    4 0.811055141
    2:  a   10 0.003948339
    3:  b    6 0.737595618
    4:  b    8 0.388108283
    5:  b    9 0.685169729
    

    A regular merge yields lots of missing values:

    d2[d1,,roll=FALSE]
        id time      value2     value
     1:  a    1          NA 0.9148060
     2:  a    2          NA 0.9370754
     3:  a    3          NA 0.2861395
     4:  a    4 0.811055141 0.8304476
     5:  a    5          NA 0.6417455
     6:  a    6          NA 0.5190959
     7:  a    7          NA 0.7365883
     8:  a    8          NA 0.1346666
     9:  a    9          NA 0.6569923
    10:  a   10 0.003948339 0.7050648
    11:  b    1          NA 0.4577418
    12:  b    2          NA 0.7191123
    13:  b    3          NA 0.9346722
    14:  b    4          NA 0.2554288
    15:  b    5          NA 0.4622928
    16:  b    6 0.737595618 0.9400145
    17:  b    7          NA 0.9782264
    18:  b    8 0.388108283 0.1174874
    19:  b    9 0.685169729 0.4749971
    20:  b   10          NA 0.5603327
    

    However, data.table allows you to roll the secondary index forward, WITHIN THE PRIMARY INDEX!

    d2[d1,,roll=TRUE]
        id time      value2     value
     1:  a    1          NA 0.9148060
     2:  a    2          NA 0.9370754
     3:  a    3          NA 0.2861395
     4:  a    4 0.811055141 0.8304476
     5:  a    5 0.811055141 0.6417455
     6:  a    6 0.811055141 0.5190959
     7:  a    7 0.811055141 0.7365883
     8:  a    8 0.811055141 0.1346666
     9:  a    9 0.811055141 0.6569923
    10:  a   10 0.003948339 0.7050648
    11:  b    1          NA 0.4577418
    12:  b    2          NA 0.7191123
    13:  b    3          NA 0.9346722
    14:  b    4          NA 0.2554288
    15:  b    5          NA 0.4622928
    16:  b    6 0.737595618 0.9400145
    17:  b    7 0.737595618 0.9782264
    18:  b    8 0.388108283 0.1174874
    19:  b    9 0.685169729 0.4749971
    20:  b   10 0.685169729 0.5603327
    

    This is pretty damn cool: Old observations are rolled forward in time, until they are replaced by new ones. If you want to replace the NA values at the beggining of the series, you can do so by rolling the first observation backwards:

    d2[d1,,roll=TRUE, rollends=c(TRUE, TRUE)]
        id time      value2     value
     1:  a    1 0.811055141 0.9148060
     2:  a    2 0.811055141 0.9370754
     3:  a    3 0.811055141 0.2861395
     4:  a    4 0.811055141 0.8304476
     5:  a    5 0.811055141 0.6417455
     6:  a    6 0.811055141 0.5190959
     7:  a    7 0.811055141 0.7365883
     8:  a    8 0.811055141 0.1346666
     9:  a    9 0.811055141 0.6569923
    10:  a   10 0.003948339 0.7050648
    11:  b    1 0.737595618 0.4577418
    12:  b    2 0.737595618 0.7191123
    13:  b    3 0.737595618 0.9346722
    14:  b    4 0.737595618 0.2554288
    15:  b    5 0.737595618 0.4622928
    16:  b    6 0.737595618 0.9400145
    17:  b    7 0.737595618 0.9782264
    18:  b    8 0.388108283 0.1174874
    19:  b    9 0.685169729 0.4749971
    20:  b   10 0.685169729 0.5603327
    

    These rolling joins are absolutely incredible, and I've never seen them implemented in any other open source package (see ?data.table for more info). It will take a little while to turn off your "SAS brain" and turn on your "R brain", but once you get over that initial hump you'll find that the language is much more expressive.

    0 讨论(0)
  • 2021-01-05 10:40

    here's an example: cumulate value with sqldf:

    > w_cum <-
     sqldf("select t1.id, t1.SomeNumt, SUM(t2.SomeNumt) as cum_sum
           from w_cum       t1
           inner join w_cum t2 on t1.id >= t2.id
           group by t1.id, t1.SomeNumt
           order by t1.id
    

    ")

       id SomeNumt cum_sum
    
    • 1 11 11
    • 2 12 23
    • 3 13 36
    • 4 14 50
    • 5 15 65
    • 6 16 81
    • 7 17 98
    • 8 18 116
    • 9 19 135
    • 10 20 155

    0 讨论(0)
  • 2021-01-05 10:43

    For retain, try this :

    retain<-function(x,event,outside=NA)
    {
      indices <- c(1,which(event==TRUE), nrow(df)+1)
      values <- c(outside,x[event==TRUE])
      y<- rep(values, diff(indices)) 
    }
    

    With data : I want to retain down the value when w==b

    df <- data.frame(w = c("a","b","c","a","b","c"), x = 1:6, y = c(1,1,2,2,2,3), stringsAsFactors = FALSE)
    df$z<-retain(df$x-df$y,df$w=="b")
    df
    

    And here's the contrary obtain, that does not exist in SAS:

    obtain<-function(x,event,outside=NA)
    {
      indices <- c(0,which(event==TRUE), nrow(df))
      values <- c(x[event==TRUE],outside)
      y<- rep(values, diff(indices)) 
    }
    

    Here's an example. I want to obtain the value in advance where w==b

    df$z2<-obtain(df$x-df$y,df$w=="b")
    df
    

    Thanks to Julien for helping.

    0 讨论(0)
提交回复
热议问题