问题
It was requested that I would break down my question asked here (R: Applying cumulative sum function and filling data gaps with NA for plotting) a little and post a smaller sample. Here it is and here you can find my sample data: https://dl.dropboxusercontent.com/u/16277659/inputdata.csv
NAME; ID; SURVEY_YEAR; REFERENCE_YEAR; VALUE
SAMPLE1; 253; 1883; 1883; 0
SAMPLE1; 253; 1884; 1883; NA
SAMPLE1; 253; 1885; 1884; 12
SAMPLE1; 253; 1890; 1889; 17
SAMPLE2; 261; 1991; 1991; 0
SAMPLE2; 261; 1992; 1991; -19
SAMPLE2; 261; 1994; 1992; -58
SAMPLE2; 261; 1995; 1994; -40
I would like to calculate the cumulative sum for the column VALUE and fill up the data gaps for the years inbetween with NA values (the structure of the data should be the same, as I need the other columns for further processing).
When filling up the data gaps NAs should be filled in like in SAMPLE1. Please note the position of the values after NA when filling in multiple NAs in the column CUMSUM (e.g. the last CUMSUM value should be filled in besides the last NA in VALUE (used for plotting reasons).
An exception is the case when the period between REFERENCE_YEAR and SURVEY_YEAR is greater than one year, the value should be written into the column like in SAMPLE2 for the period 1992 to 1994.
This is only a sample dataset, my actual dataset consists of several columns and of about 40000 rows. Best would be a solution in BaseR. The REFERENCE_YEAR and SURVEY_YEAR being equal in the first row for each SAMPLE is the result of the code I use for writing a zero column for each group.
NAME; ID; SURVEY_YEAR; REFERENCE_YEAR; VALUE; CUMSUM
SAMPLE1; 253; 1883; 1883; 0; 0
SAMPLE1; 253; 1884; 1883; NA; NA
SAMPLE1; 253; 1885; 1884; 12; 12
SAMPLE1; 253; 1886; 1885; NA; NA
SAMPLE1; 253; 1887; 1886; NA; NA
SAMPLE1; 253; 1888; 1887; NA; NA
SAMPLE1; 253; 1889; 1888; NA; 12
SAMPLE1; 253; 1890; 1889; 17; 29
SAMPLE2; 261; 1991; 1991; 0; 0
SAMPLE2; 261; 1992; 1991; -19; -19
SAMPLE2; 261; 1993; 1992; -58; -77
SAMPLE2; 261; 1994; 1992; -58; -77
SAMPLE2; 261; 1995; 1994; -40; -117
--------------------------------------------------------------------------------------------
回答1:
If dat
is the dataset, one way would be:
Create a new dataset by expanding between minimum and maximum SURVEY_YEAR
for each NAME
dat1 <- setNames(stack(
with(dat, tapply(SURVEY_YEAR, NAME,
FUN=function(x) seq(min(x), max(x)))))[2:1], c("NAME", "SURVEY_YEAR"))
Merge the new dataset dat1
with old dat
datN <- merge(dat1, dat, all=TRUE)
Replace the missing values in REFERENCE_YEAR
by SURVEY_YEAR
from the previous row
datN$REFERENCE_YEAR[is.na(datN$REFERENCE_YEAR)] <- datN$SURVEY_YEAR[which(is.na(datN$REFERENCE_YEAR))-1]
Use na.locf
from zoo
to fill the NA's for ID
library(zoo)
datN$ID <- na.locf(datN$ID)
datN$CUMSUM <- NA
Do cumsum
on the non-NA VALUE
rows and
datN$CUMSUM[!is.na(datN$VALUE)] <- unlist(with(datN, tapply(VALUE, NAME, FUN=function(x) cumsum(x[!is.na(x)]))))
Look for rows having a difference between SURVEY_YEAR and REFERENCE_YEAR >1
indx <- with(datN, SURVEY_YEAR-REFERENCE_YEAR)>1
Replace those rows in VALUE
and CUMSUM
columns with the next row values
datN[,c("VALUE", "CUMSUM")] <- lapply(datN[,c("VALUE", "CUMSUM")], function(x) {x[which(indx)-1] <- x[indx]; x})
Change some of the NA
values in CUMSUM
to previous non-NA
value
datN$CUMSUM <- with(datN, ave(CUMSUM, NAME, FUN = function(x) {
x1 <- is.na(x)
rl <- rle(x1)
indx <- which(!(!(abs(x1 - 1) * (cumsum(x1) != 0) * sequence(rl$lengths)))) - 1
indx1 <- indx[indx - c(1, indx[-length(indx)]) > 1]
indxn <- unlist(lapply(indx1, function(y) {
indx2 <- which(!is.na(x))
tail(indx2[which(indx2 < y)], 1)
}))
x[indx1] <- x[indxn]
x
}))
datN
# NAME SURVEY_YEAR ID REFERENCE_YEAR VALUE CUMSUM
#1 SAMPLE1 1883 253 1883 0 0
#2 SAMPLE1 1884 253 1883 NA NA
#3 SAMPLE1 1885 253 1884 12 12
#4 SAMPLE1 1886 253 1885 NA NA
#5 SAMPLE1 1887 253 1886 NA NA
#6 SAMPLE1 1888 253 1887 NA NA
#7 SAMPLE1 1889 253 1888 NA 12
#8 SAMPLE1 1890 253 1889 17 29
#9 SAMPLE2 1991 261 1991 0 0
#10 SAMPLE2 1992 261 1991 -19 -19
#11 SAMPLE2 1993 261 1992 -58 -77
#12 SAMPLE2 1994 261 1992 -58 -77
#13 SAMPLE2 1995 261 1994 -40 -117
来源:https://stackoverflow.com/questions/25276758/r-filling-up-data-gaps-with-nas-and-applying-cumsum-function