问题
I have data that looks like the following:
Region X2012 X2013 X2014 X2015 X2016 X2017
1 1 10 11 12 13 14 15
2 2 NA 17 14 NA 23 NA
3 3 12 18 18 NA 23 NA
4 4 NA NA 15 28 NA 38
5 5 14 18.5 16 27 25 39
6 6 15 NA 17 27.5 NA 39
The numbers are irrelevant here but what I am trying to do is take the difference between the earliest and latest observed points in each row to make a new column for the difference where:
Region Diff
1 (15 - 10) = 5
2 (23 - 17) = 6
and so on, not actually showing the subtraction but the final result. Ideally i would just subtract the 2017 column from the 2012 column but since any row's first observationcould start at any column and also end at any column I am unsure of how to take the difference.
A dplyr solution would be ideal but any solution at all is appreciated.
回答1:
Define a function which takes the last minus the first element of its vector argument omitting NAs and apply it to each row.
lastMinusFirst <- function(x, y = na.omit(x)) tail(y, 1) - y[1]
transform(DF, diff = apply(DF[-1], 1, lastMinusFirst))
giving:
Region X2012 X2013 X2014 X2015 X2016 X2017 diff
1 1 10 11.0 12 13.0 14 15 5
2 2 NA 17.0 14 NA 23 NA 6
3 3 12 18.0 18 NA 23 NA 11
4 4 NA NA 15 28.0 NA 38 23
5 5 14 18.5 16 27.0 25 39 25
6 6 15 NA 17 27.5 NA 39 24
Note
The input in reproducible form:
Lines <- "Region X2012 X2013 X2014 X2015 X2016 X2017
1 1 10 11 12 13 14 15
2 2 NA 17 14 NA 23 NA
3 3 12 18 18 NA 23 NA
4 4 NA NA 15 28 NA 38
5 5 14 18.5 16 27 25 39
6 6 NA NA NA NA NA NA"
DF <- read.table(text = Lines)
Update
Fixed.
回答2:
A tidyverse answer.
This answer modifies G. Grothendieck's function and uses Jenny Bryan's pmap method for row-wise calculations from the purrr package.
library(tidyverse)
set.seed(7)
# make data
df <- data.frame(region=c(1:5),matrix(sample(c(rep(NA,7),1:10),30,T),ncol=6))
# name the columns
names(df)[2:7] <- paste0('X',c(2012:2017))
# G. Grothendieck's function but unlist x and use dplyr's first() and last() functions
lastMinusFirst <- function(x, y = unlist(x)) last(na.omit(x)) - first(na.omit(x))
df %>%
mutate(Diff = pmap_int(select(., starts_with("X")), # select columns, use pmap to list their contents
.f = lift_vd(lastMinusFirst))) # lift_vd around the function to allow ... argument
giving:
region X2012 X2013 X2014 X2015 X2016 X2017 Diff
1 1 3 NA 1 4 4 NA 1
2 2 NA 1 8 NA 1 6 5
3 3 NA 8 NA NA 10 2 -6
4 4 8 1 9 NA 7 1 -7
5 5 1 5 NA NA NA 6 5
回答3:
We can use max.col
using it's ties.method
argument. We subtract last non-NA value in each row with first non-NA value.
new_df <- !is.na(df[-1])
df$diff <- df[-1][cbind(seq_len(nrow(new_df)), max.col(new_df, ties.method = "last"))] -
df[-1][cbind(seq_len(nrow(new_df)), max.col(new_df, ties.method = "first"))]
df
# Region X2012 X2013 X2014 X2015 X2016 X2017 diff
#1 1 10 11.0 12 13.0 14 15 5
#2 2 NA 17.0 14 NA 23 NA 6
#3 3 12 18.0 18 NA 23 NA 11
#4 4 NA NA 15 28.0 NA 38 23
#5 5 14 18.5 16 27.0 25 39 25
#6 6 15 NA 17 27.5 NA 39 24
A tidyverse
answer could be to gather
data to long format removing NA
values and for each Region
subtract last
value
with the first
one.
library(dplyr)
df %>%
tidyr::gather(key, value, -Region, na.rm = TRUE) %>%
group_by(Region) %>%
summarise(diff = last(value) - first(value))
# Region diff
# <int> <dbl>
#1 1 5
#2 2 6
#3 3 11
#4 4 23
#5 5 25
#6 6 24
来源:https://stackoverflow.com/questions/57437497/take-difference-between-first-and-last-observations-in-a-row-where-each-row-is