Take difference between first and last observations in a row, where each row is different

问题

I have data that looks like the following:

  Region X2012 X2013 X2014 X2015 X2016 X2017
1      1    10    11    12    13    14    15
2      2    NA    17    14    NA    23    NA
3      3    12    18    18    NA    23    NA
4      4    NA    NA    15    28    NA    38
5      5    14  18.5    16    27    25    39
6      6    15    NA    17  27.5    NA    39

The numbers are irrelevant here but what I am trying to do is take the difference between the earliest and latest observed points in each row to make a new column for the difference where:

Region              Diff
     1     (15 - 10) = 5
     2     (23 - 17) = 6

and so on, not actually showing the subtraction but the final result. Ideally i would just subtract the 2017 column from the 2012 column but since any row's first observationcould start at any column and also end at any column I am unsure of how to take the difference.

A dplyr solution would be ideal but any solution at all is appreciated.

回答1:

Define a function which takes the last minus the first element of its vector argument omitting NAs and apply it to each row.

lastMinusFirst <- function(x, y = na.omit(x)) tail(y, 1) - y[1]
transform(DF, diff = apply(DF[-1], 1, lastMinusFirst))

giving:

  Region X2012 X2013 X2014 X2015 X2016 X2017 diff
1      1    10  11.0    12  13.0    14    15    5
2      2    NA  17.0    14    NA    23    NA    6
3      3    12  18.0    18    NA    23    NA   11
4      4    NA    NA    15  28.0    NA    38   23
5      5    14  18.5    16  27.0    25    39   25
6      6    15    NA    17  27.5    NA    39   24

Note

The input in reproducible form:

Lines <- "Region X2012 X2013 X2014 X2015 X2016 X2017
1      1    10    11    12    13    14    15
2      2    NA    17    14    NA    23    NA
3      3    12    18    18    NA    23    NA
4      4    NA    NA    15    28    NA    38
5      5    14  18.5    16    27    25    39
6      6    NA    NA    NA    NA    NA    NA"
DF <- read.table(text = Lines)

Update

Fixed.

回答2:

A tidyverse answer.

This answer modifies G. Grothendieck's function and uses Jenny Bryan's pmap method for row-wise calculations from the purrr package.

library(tidyverse)

set.seed(7)

# make data
df <- data.frame(region=c(1:5),matrix(sample(c(rep(NA,7),1:10),30,T),ncol=6))

# name the columns
names(df)[2:7] <- paste0('X',c(2012:2017))

# G. Grothendieck's function but unlist x and use dplyr's first() and last() functions
lastMinusFirst <- function(x, y = unlist(x)) last(na.omit(x)) - first(na.omit(x))

df %>%
  mutate(Diff = pmap_int(select(., starts_with("X")), # select columns, use pmap to list their contents
                         .f = lift_vd(lastMinusFirst))) # lift_vd around the function to allow ... argument

giving:

  region X2012 X2013 X2014 X2015 X2016 X2017 Diff
1      1     3    NA     1     4     4    NA    1
2      2    NA     1     8    NA     1     6    5
3      3    NA     8    NA    NA    10     2   -6
4      4     8     1     9    NA     7     1   -7
5      5     1     5    NA    NA    NA     6    5

回答3:

We can use max.col using it's ties.method argument. We subtract last non-NA value in each row with first non-NA value.

new_df <- !is.na(df[-1])

df$diff <- df[-1][cbind(seq_len(nrow(new_df)), max.col(new_df, ties.method = "last"))] -
           df[-1][cbind(seq_len(nrow(new_df)), max.col(new_df, ties.method = "first"))]

df
#  Region X2012 X2013 X2014 X2015 X2016 X2017 diff
#1      1    10  11.0    12  13.0    14    15    5
#2      2    NA  17.0    14    NA    23    NA    6
#3      3    12  18.0    18    NA    23    NA   11
#4      4    NA    NA    15  28.0    NA    38   23
#5      5    14  18.5    16  27.0    25    39   25
#6      6    15    NA    17  27.5    NA    39   24

A tidyverse answer could be to gather data to long format removing NA values and for each Region subtract last value with the first one.

library(dplyr)
df %>%
  tidyr::gather(key, value, -Region, na.rm = TRUE) %>%
  group_by(Region) %>%
  summarise(diff = last(value) - first(value))

#  Region  diff
#   <int> <dbl>
#1      1     5
#2      2     6
#3      3    11
#4      4    23
#5      5    25
#6      6    24

来源：https://stackoverflow.com/questions/57437497/take-difference-between-first-and-last-observations-in-a-row-where-each-row-is

标签

dplyr

tidyverse