问题
Let's say, I have scores for 5 countries over a period of 10 years such as:
mydata<-1:3
mydata<-expand.grid(
country=c('A', 'B', 'C', 'D', 'E'),
year=c('1980','1981','1982','1983','1984','1985','1986','1987','1988','1989'))
mydata$score=sapply(runif(50,0,2), function(x) {round(x,4)})
library(reshape)
mydata<-reshape(mydata, v.names="score", idvar="year", timevar="country", direction="wide")
> head(mydata)
year score.A score.B score.C score.D score.E
1 1980 1.0538 1.6921 1.3165 1.7434 1.9687
6 1981 1.4773 1.6479 0.3135 0.6172 0.7704
11 1982 0.8748 1.3704 0.2788 1.6306 1.7237
16 1983 1.1224 1.1340 1.7684 1.3352 0.4317
21 1984 1.5496 1.8706 1.4641 0.5313 0.8590
26 1985 1.7715 1.8953 0.6230 0.3580 1.6313
Now, I would like to create a new variable "period" that is 1 if the score of the subsequent year is +/- 0.5 different from the score of the previous year and that is 0 if this is not true. I would like to do so for all 5 countries. And it would be great if it were possible to identify the country-years for which period = 1 and display this information in a table.
> head(mydata)
year score.A score.B score.C score.D score.E period.A period.B ...
1 1980 1.0538 1.6921 1.3165 1.7434 1.9687 NA NA
6 1981 1.4773 1.6479 0.3135 0.6172 0.7704 0 ....
11 1982 0.8748 1.3704 0.2788 1.6306 1.7237 1
16 1983 1.1224 1.1340 1.7684 1.3352 0.4317 0
21 1984 1.5496 1.8706 1.4641 0.5313 0.8590 0
26 1985 1.7715 1.8953 0.6230 0.3580 1.6313 0
I very much hope that this is not too much to ask. I tried it with dist
in the library(proxy)
but I do not know how to restrict the function to pairs of observation rather than the full row. Thanks a million!!
回答1:
This one uses diff
and lapply
:
score.cols <- grep("score", colnames(mydata), value=TRUE)
period.cols <- gsub("score", "period", score.cols)
compute.period <- function(x)as.integer(c(NA, abs(diff(x)) > 0.5))
cbind(mydata, `names<-`(lapply(mydata[score.cols], compute.period), period.cols))
Edit: It becomes more apparent (with your other question posted this morning) that maybe you are not working with the right data structure. Instead, I would recommend you do your work on the raw (before it is reshaped) data:
period.fun <- function(x)as.integer(c(NA, abs(diff(x) > 0.5)))
mydata <- within(mydata, period <- ave(score, country, FUN = period.fun))
Only then you would reshape mydata
to get it in its final form.
回答2:
First, create the data, using set.seed()
to make it reproducible:
set.seed(1014)
mydata <- expand.grid(
country = c('A', 'B', 'C', 'D', 'E'),
year = 1980:1989
)
mydata$score <- round(runif(50, 0, 2), 4)
head(mydata)
#> country year score
#> 1 A 1980 0.1615
#> 2 B 1980 1.6687
#> 3 C 1980 1.2015
#> 4 D 1980 0.3144
#> 5 E 1980 0.0148
#> 6 A 1981 0.9328
Next, use dplyr to break into countries and compare each value to the previous:
library(dplyr)
out <- mydata %.%
group_by(country) %.%
mutate(big_diff = abs(score - lag(score)) > 0.5)
out %.%
arrange(country, year) %.%
head(10)
#> Source: local data frame [10 x 4]
#> Groups: country
#>
#> country year score big_diff
#> 1 A 1980 0.1615 NA
#> 2 A 1981 0.9328 TRUE
#> 3 A 1982 1.7492 TRUE
#> 4 A 1983 0.3913 TRUE
#> 5 A 1984 0.5798 FALSE
#> 6 A 1985 1.4830 TRUE
#> 7 A 1986 0.0625 TRUE
#> 8 A 1987 0.8643 TRUE
#> 9 A 1988 1.3603 FALSE
#> 10 A 1989 1.5312 FALSE
After this you could coerce big_diff()
to a number, and use reshape to
move country
to the columns, but I probably wouldn't because it will
be harder to work with in future steps. See tidy
data for more details.
回答3:
library(stringr)
periods <- function(mydata) {
# pull out columns with score in the title
score_columns <- mydata[, str_detect(names(mydata), "score")]
# make a copy to store the periods
period_columns <- score_columns
# rename the columns in periods
names(period_columns) <- str_replace_all(names(period_columns), "score", "periods")
for ( i in 1:length(score_columns))
{
offset <- c(NA,score_columns[2:length(score_columns[,i])-1,i])
# if the diff is > 0.5, return 1 else return 0.
period_columns[, i] <- ifelse(offset - score_columns[,i]>0.5, 1, 0)
}
return(cbind(data,period_columns))
}
# Then simply call the function on your data. It should work with variable number
# of score columns.
> periods(mydata)
year score.A score.B score.C score.D score.E periods.A
1 1980 1.8251 1.3168 0.9264 1.4921 0.9870 NA
6 1981 0.7603 1.7270 0.0324 1.8332 0.7147 1
11 1982 1.5245 0.6904 1.1699 0.5918 0.3029 0
16 1983 0.5280 0.2333 1.4395 1.2145 0.7273 1
21 1984 1.8739 1.8420 0.9940 0.2886 1.5975 0
26 1985 1.8794 0.7352 1.1665 0.9859 1.1301 0
31 1986 1.8002 0.3546 0.3885 1.9985 1.7183 0
36 1987 1.7985 1.0536 1.8445 0.8573 1.9307 0
41 1988 1.8444 0.6644 1.4765 0.2586 0.5531 0
46 1989 0.7342 0.4921 0.5816 0.8954 0.9359 1
periods.B periods.C periods.D periods.E
1 NA NA NA NA
6 0 1 0 0
11 1 0 1 0
16 0 0 0 0
21 0 0 1 0
26 1 0 0 0
31 0 1 0 0
36 0 0 1 0
41 0 0 1 1
46 0 1 0 0
回答4:
You can do it with just one line with dplyr
:
library(dplyr)
df2<-mydata%.%group_by(country)%.%mutate(period = c(NA, as.numeric(abs(diff(score))>0.5)))
Then you can reshape it with dcast
library(reshape2)
dcast(df2,year~country)
Results:
year A B C D E
1 1980 NA NA NA NA NA
2 1981 0 1 0 0 1
3 1982 1 0 1 1 0
4 1983 1 0 1 0 1
5 1984 1 1 0 1 1
6 1985 1 0 1 0 0
7 1986 1 1 1 0 1
8 1987 1 1 0 0 0
9 1988 0 0 1 1 1
10 1989 0 0 0 0 0
来源:https://stackoverflow.com/questions/12443202/how-to-get-the-difference-in-value-between-subsequent-observations-country-year