R: filling up data gaps with NAs and applying cumsum function

问题

It was requested that I would break down my question asked here (R: Applying cumulative sum function and filling data gaps with NA for plotting) a little and post a smaller sample. Here it is and here you can find my sample data: https://dl.dropboxusercontent.com/u/16277659/inputdata.csv

NAME;       ID;     SURVEY_YEAR;    REFERENCE_YEAR; VALUE
SAMPLE1;    253;    1883;           1883;           0
SAMPLE1;    253;    1884;           1883;           NA
SAMPLE1;    253;    1885;           1884;           12
SAMPLE1;    253;    1890;           1889;           17
SAMPLE2;    261;    1991;           1991;           0
SAMPLE2;    261;    1992;           1991;           -19
SAMPLE2;    261;    1994;           1992;           -58
SAMPLE2;    261;    1995;           1994;           -40

I would like to calculate the cumulative sum for the column VALUE and fill up the data gaps for the years inbetween with NA values (the structure of the data should be the same, as I need the other columns for further processing).

When filling up the data gaps NAs should be filled in like in SAMPLE1. Please note the position of the values after NA when filling in multiple NAs in the column CUMSUM (e.g. the last CUMSUM value should be filled in besides the last NA in VALUE (used for plotting reasons).

An exception is the case when the period between REFERENCE_YEAR and SURVEY_YEAR is greater than one year, the value should be written into the column like in SAMPLE2 for the period 1992 to 1994.

This is only a sample dataset, my actual dataset consists of several columns and of about 40000 rows. Best would be a solution in BaseR. The REFERENCE_YEAR and SURVEY_YEAR being equal in the first row for each SAMPLE is the result of the code I use for writing a zero column for each group.

NAME;       ID;     SURVEY_YEAR;    REFERENCE_YEAR; VALUE;  CUMSUM
SAMPLE1;    253;    1883;           1883;           0;      0
SAMPLE1;    253;    1884;           1883;           NA;     NA
SAMPLE1;    253;    1885;           1884;           12;     12
SAMPLE1;    253;    1886;           1885;           NA;     NA
SAMPLE1;    253;    1887;           1886;           NA;     NA
SAMPLE1;    253;    1888;           1887;           NA;     NA
SAMPLE1;    253;    1889;           1888;           NA;     12
SAMPLE1;    253;    1890;           1889;           17;     29
SAMPLE2;    261;    1991;           1991;           0;      0
SAMPLE2;    261;    1992;           1991;           -19;    -19
SAMPLE2;    261;    1993;           1992;           -58;    -77
SAMPLE2;    261;    1994;           1992;           -58;    -77
SAMPLE2;    261;    1995;           1994;           -40;    -117

--------------------------------------------------------------------------------------------

回答1:

If dat is the dataset, one way would be:

Create a new dataset by expanding between minimum and maximum SURVEY_YEAR for each NAME

 dat1 <- setNames(stack(
             with(dat, tapply(SURVEY_YEAR, NAME, 
                FUN=function(x) seq(min(x), max(x)))))[2:1], c("NAME", "SURVEY_YEAR"))

Merge the new dataset dat1 with old dat

 datN <- merge(dat1, dat, all=TRUE)

Replace the missing values in REFERENCE_YEAR by SURVEY_YEAR from the previous row

 datN$REFERENCE_YEAR[is.na(datN$REFERENCE_YEAR)] <- datN$SURVEY_YEAR[which(is.na(datN$REFERENCE_YEAR))-1]

Use na.locf from zoo to fill the NA's for ID

 library(zoo)
 datN$ID <- na.locf(datN$ID)
 datN$CUMSUM <- NA

Do cumsum on the non-NA VALUE rows and

 datN$CUMSUM[!is.na(datN$VALUE)] <-  unlist(with(datN, tapply(VALUE, NAME, FUN=function(x) cumsum(x[!is.na(x)]))))

Look for rows having a difference between SURVEY_YEAR and REFERENCE_YEAR >1

 indx <- with(datN, SURVEY_YEAR-REFERENCE_YEAR)>1

Replace those rows in VALUE and CUMSUM columns with the next row values

 datN[,c("VALUE", "CUMSUM")] <- lapply(datN[,c("VALUE", "CUMSUM")], function(x) {x[which(indx)-1] <- x[indx]; x})

Change some of the NA values in CUMSUM to previous non-NA value

datN$CUMSUM <- with(datN, ave(CUMSUM, NAME, FUN = function(x) {
x1 <- is.na(x)
rl <- rle(x1)
indx <- which(!(!(abs(x1 - 1) * (cumsum(x1) != 0) * sequence(rl$lengths)))) - 1
indx1 <- indx[indx - c(1, indx[-length(indx)]) > 1]
indxn <- unlist(lapply(indx1, function(y) {
    indx2 <- which(!is.na(x))
    tail(indx2[which(indx2 < y)], 1)
}))
x[indx1] <- x[indxn]
x
}))

datN
#      NAME SURVEY_YEAR  ID REFERENCE_YEAR VALUE CUMSUM
#1  SAMPLE1        1883 253           1883     0      0
#2  SAMPLE1        1884 253           1883    NA     NA
#3  SAMPLE1        1885 253           1884    12     12
#4  SAMPLE1        1886 253           1885    NA     NA
#5  SAMPLE1        1887 253           1886    NA     NA
#6  SAMPLE1        1888 253           1887    NA     NA
#7  SAMPLE1        1889 253           1888    NA     12
#8  SAMPLE1        1890 253           1889    17     29
#9  SAMPLE2        1991 261           1991     0      0
#10 SAMPLE2        1992 261           1991   -19    -19
#11 SAMPLE2        1993 261           1992   -58    -77
#12 SAMPLE2        1994 261           1992   -58    -77
#13 SAMPLE2        1995 261           1994   -40   -117

来源：https://stackoverflow.com/questions/25276758/r-filling-up-data-gaps-with-nas-and-applying-cumsum-function

标签

dataframe

cumsum