How to get the difference in value between subsequent observations (country-years)?

问题

Let's say, I have scores for 5 countries over a period of 10 years such as:

mydata<-1:3
mydata<-expand.grid(
country=c('A', 'B', 'C', 'D', 'E'),
year=c('1980','1981','1982','1983','1984','1985','1986','1987','1988','1989'))
mydata$score=sapply(runif(50,0,2), function(x) {round(x,4)})

library(reshape)
mydata<-reshape(mydata, v.names="score", idvar="year", timevar="country", direction="wide")

> head(mydata)
   year score.A score.B score.C score.D score.E
1  1980  1.0538  1.6921  1.3165  1.7434  1.9687
6  1981  1.4773  1.6479  0.3135  0.6172  0.7704
11 1982  0.8748  1.3704  0.2788  1.6306  1.7237
16 1983  1.1224  1.1340  1.7684  1.3352  0.4317
21 1984  1.5496  1.8706  1.4641  0.5313  0.8590
26 1985  1.7715  1.8953  0.6230  0.3580  1.6313

Now, I would like to create a new variable "period" that is 1 if the score of the subsequent year is +/- 0.5 different from the score of the previous year and that is 0 if this is not true. I would like to do so for all 5 countries. And it would be great if it were possible to identify the country-years for which period = 1 and display this information in a table.

> head(mydata)
   year score.A score.B score.C score.D score.E  period.A  period.B ...
1  1980  1.0538  1.6921  1.3165  1.7434  1.9687   NA         NA
6  1981  1.4773  1.6479  0.3135  0.6172  0.7704   0          ....
11 1982  0.8748  1.3704  0.2788  1.6306  1.7237   1
16 1983  1.1224  1.1340  1.7684  1.3352  0.4317   0
21 1984  1.5496  1.8706  1.4641  0.5313  0.8590   0
26 1985  1.7715  1.8953  0.6230  0.3580  1.6313   0

I very much hope that this is not too much to ask. I tried it with dist in the library(proxy) but I do not know how to restrict the function to pairs of observation rather than the full row. Thanks a million!!

回答1:

This one uses diff and lapply:

score.cols  <- grep("score", colnames(mydata), value=TRUE)
period.cols <- gsub("score", "period", score.cols)
compute.period <- function(x)as.integer(c(NA, abs(diff(x)) > 0.5))
cbind(mydata, `names<-`(lapply(mydata[score.cols], compute.period), period.cols))

Edit: It becomes more apparent (with your other question posted this morning) that maybe you are not working with the right data structure. Instead, I would recommend you do your work on the raw (before it is reshaped) data:

period.fun <- function(x)as.integer(c(NA, abs(diff(x) > 0.5)))
mydata <- within(mydata, period <- ave(score, country, FUN = period.fun))

Only then you would reshape mydata to get it in its final form.

回答2:

First, create the data, using set.seed() to make it reproducible:

set.seed(1014)
mydata <- expand.grid(
  country = c('A', 'B', 'C', 'D', 'E'),
  year = 1980:1989
)
mydata$score <- round(runif(50, 0, 2), 4)
head(mydata)

#>   country year  score
#> 1       A 1980 0.1615
#> 2       B 1980 1.6687
#> 3       C 1980 1.2015
#> 4       D 1980 0.3144
#> 5       E 1980 0.0148
#> 6       A 1981 0.9328

Next, use dplyr to break into countries and compare each value to the previous:

library(dplyr)
out <- mydata %.% 
  group_by(country) %.%
  mutate(big_diff = abs(score - lag(score)) > 0.5)

out %.% 
  arrange(country, year) %.% 
  head(10)

#> Source: local data frame [10 x 4]
#> Groups: country
#> 
#>    country year  score big_diff
#> 1        A 1980 0.1615       NA
#> 2        A 1981 0.9328     TRUE
#> 3        A 1982 1.7492     TRUE
#> 4        A 1983 0.3913     TRUE
#> 5        A 1984 0.5798    FALSE
#> 6        A 1985 1.4830     TRUE
#> 7        A 1986 0.0625     TRUE
#> 8        A 1987 0.8643     TRUE
#> 9        A 1988 1.3603    FALSE
#> 10       A 1989 1.5312    FALSE

After this you could coerce big_diff() to a number, and use reshape to move country to the columns, but I probably wouldn't because it will be harder to work with in future steps. See tidy data for more details.

回答3:

library(stringr)
periods <- function(mydata) {
# pull out columns with score in the title
score_columns <- mydata[, str_detect(names(mydata), "score")]
# make a copy to store the periods
period_columns <- score_columns
# rename the columns in periods
names(period_columns) <- str_replace_all(names(period_columns), "score", "periods")

for ( i in 1:length(score_columns))
    { 
        offset <- c(NA,score_columns[2:length(score_columns[,i])-1,i])
            # if the diff is > 0.5, return 1 else return 0.
        period_columns[, i] <- ifelse(offset - score_columns[,i]>0.5, 1, 0)
    }

 return(cbind(data,period_columns))
}

# Then simply call the function on your data. It should work with variable number
# of score columns.

> periods(mydata)
   year score.A score.B score.C score.D score.E periods.A
1  1980  1.8251  1.3168  0.9264  1.4921  0.9870        NA
6  1981  0.7603  1.7270  0.0324  1.8332  0.7147         1
11 1982  1.5245  0.6904  1.1699  0.5918  0.3029         0
16 1983  0.5280  0.2333  1.4395  1.2145  0.7273         1
21 1984  1.8739  1.8420  0.9940  0.2886  1.5975         0
26 1985  1.8794  0.7352  1.1665  0.9859  1.1301         0
31 1986  1.8002  0.3546  0.3885  1.9985  1.7183         0
36 1987  1.7985  1.0536  1.8445  0.8573  1.9307         0
41 1988  1.8444  0.6644  1.4765  0.2586  0.5531         0
46 1989  0.7342  0.4921  0.5816  0.8954  0.9359         1
   periods.B periods.C periods.D periods.E
1         NA        NA        NA        NA
6          0         1         0         0
11         1         0         1         0
16         0         0         0         0
21         0         0         1         0
26         1         0         0         0
31         0         1         0         0
36         0         0         1         0
41         0         0         1         1
46         0         1         0         0

回答4:

You can do it with just one line with dplyr:

library(dplyr)
df2<-mydata%.%group_by(country)%.%mutate(period = c(NA, as.numeric(abs(diff(score))>0.5)))

Then you can reshape it with dcast

library(reshape2)
dcast(df2,year~country)

Results:

   year  A  B  C  D  E
1  1980 NA NA NA NA NA
2  1981  0  1  0  0  1
3  1982  1  0  1  1  0
4  1983  1  0  1  0  1
5  1984  1  1  0  1  1
6  1985  1  0  1  0  0
7  1986  1  1  1  0  1
8  1987  1  1  0  0  0
9  1988  0  0  1  1  1
10 1989  0  0  0  0  0

来源：https://stackoverflow.com/questions/12443202/how-to-get-the-difference-in-value-between-subsequent-observations-country-year

标签

dataframe

data.table

dplyr

panel-data