R how can I calculate difference between rows in a data frame

前端 未结 6 2015
挽巷
挽巷 2020-12-03 01:13

Here is a simple example of my problem:

> df <- data.frame(ID=1:10,Score=4*10:1)
> df
       ID Score
    1   1    40
    2   2    36
    3   3    3         


        
相关标签:
6条回答
  • 2020-12-03 01:35

    Perhaps you are looking for something like this:

    > tail(df, -1) - head(df, -1)
       ID Score
    2   1    -4
    3   1    -4
    4   1    -4
    5   1    -4
    6   1    -4
    7   1    -4
    8   1    -4
    9   1    -4
    10  1    -4
    

    You can subtract or add two data.frames together if they are the same dimensions. So, what we are doing here is subtracting one data.frame that is missing the first row (tail(df, -1)) and one that is missing the last row (head(df, -1)) and subtracting them.

    0 讨论(0)
  • 2020-12-03 01:38

    Another option using dplyr would be using mutate_each to loop through all the columns, get the difference of the column (.) with the lag of the column (.) and remove the NA element at the top with na.omit()

    library(dplyr)
    df %>%
        mutate_each(funs(. - lag(.))) %>%
        na.omit() 
    

    EDIT:

    Instead of mutate_each (deprecated - as mentioned by @PatrickT) use mutate_all

    df %>%
        mutate_all(funs(. - lag(.))) %>%
        na.omit() 
    

    Or with shift from data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), loop through the columns (lapply(.SD, ..)) and get the difference between the column (x) and thelag(shiftby default gives thelagastype = "lag"`). Remove the first observation i.e. NA element.

    library(data.table)
    setDT(df)[, lapply(.SD, function(x) (x- shift(x))[-1])]
    
    0 讨论(0)
  • 2020-12-03 01:40

    Adding this a few years later for completeness- you can use a simple [.data.frame subseting in order to achieve this too

    df[-1, ] - df[-nrow(df), ]
    #    ID Score
    # 2   1    -4
    # 3   1    -4
    # 4   1    -4
    # 5   1    -4
    # 6   1    -4
    # 7   1    -4
    # 8   1    -4
    # 9   1    -4
    # 10  1    -4
    
    0 讨论(0)
  • 2020-12-03 01:42

    Because df works on vector or matrix. You can use apply to apply the function across columns like so:

     apply( df , 2 , diff )
       ID Score
    2   1    -4
    3   1    -4
    4   1    -4
    5   1    -4
    6   1    -4
    7   1    -4
    8   1    -4
    9   1    -4
    10  1    -4
    

    It seems unlikely that you want to calculate the difference in sequential IDs, so you could choose to apply it on all columns except the first like so:

    apply( df[-1] , 2 , diff )
    

    Or you could use data.table (not that it adds anything here I just really want to start using it!), and I am again assuming that you do not want to apply diff to the ID column:

    DT <- data.table(df)
    DT[ , list(ID,Score,Diff=diff(Score))  ]
        ID Score Diff
     1:  1    40   -4
     2:  2    36   -4
     3:  3    32   -4
     4:  4    28   -4
     5:  5    24   -4
     6:  6    20   -4
     7:  7    16   -4
     8:  8    12   -4
     9:  9     8   -4
    10: 10     4   -4
    

    And thanks to @AnandaMahto an alternative syntax that gives more flexibility to choose which columns to run it on could be:

    DT[, lapply(.SD, diff), .SDcols = 1:2]
    

    Here .SDcols = 1:2 means you want to apply the diff function to columns 1 and 2. If you have 20 columns and didn't want to apply it to ID you could use .SDcols=2:20 as an example.

    0 讨论(0)
  • 2020-12-03 01:44

    I would like to show an alternative way for doing such kind of things even often I have the feeling it is not appreciated doing this in that way: using sql.

    sqldf(paste("SELECT a.ID,a.Score"
                ,"      , a.Score - (SELECT b.Score"
                ,"                   FROM df b"
                ,"                   WHERE b.ID < a.ID"
                ,"                   ORDER BY b.ID DESC"
                ,"                   ) diff"
                ," FROM df a"
                )
          )
    

    The code seems complicated but it is not and it has some advantage, as you can see at the results:

        ID Score diff
     1   1    40 <NA>
     2   2    36 -4.0
     3   3    32 -4.0
     4   4    28 -4.0
     5   5    24 -4.0
     6   6    20 -4.0
     7   7    16 -4.0
     8   8    12 -4.0
     9   9     8 -4.0
     10 10     4 -4.0
    

    One advantage is that you use the original dataframe (without converting into other classes) and you get a data frame (put it in res <- ....). Another advantage is that you have still all rows. And the third advantage is that you can easily consider grouping factors. For example:

    df2 <- data.frame(ID=1:10,grp=rep(c("v","w"), each=5),Score=4*10:1)
    
    sqldf(paste("SELECT a.ID,a.grp,a.Score"
                ,"      , a.Score - (SELECT b.Score"
                ,"                   FROM df2 b"
                ,"                   WHERE b.ID < a.ID"
                ,"                         AND a.grp = b.grp"
                ,"                   ORDER BY b.ID DESC"
                ,"                   ) diff"
         ," FROM df2 a"
         )
    )
    
    
       ID grp Score diff
    1   1   v    40 <NA>
    2   2   v    36 -4.0
    3   3   v    32 -4.0
    4   4   v    28 -4.0
    5   5   v    24 -4.0
    6   6   w    20 <NA>
    7   7   w    16 -4.0
    8   8   w    12 -4.0
    9   9   w     8 -4.0
    10 10   w     4 -4.0
    
    0 讨论(0)
  • 2020-12-03 01:45

    diff wants a matrix or a vector rather than a data frame. Try

    data.frame(diff(as.matrix(df)))
    
    0 讨论(0)
提交回复
热议问题