问题
I have a dataframe which looks like this and I am trying to calculate the cumulative sum for the row VALUE. The input file can also be found here: https://dl.dropboxusercontent.com/u/16277659/input.csv
df <-read.csv("input.csv", sep=";", header=TRUE)
NAME; ID; SURVEY_YEAR REFERENCE_YEAR; VALUE
SAMPLE1; 253; 1880; 1879; 14
SAMPLE1; 253; 1881; 1880; -10
SAMPLE1; 253; 1882; 1881; 4
SAMPLE1; 253; 1883; 1882; 10
SAMPLE1; 253; 1884; 1883; 10
SAMPLE1; 253; 1885; 1884; 12
SAMPLE1; 253; 1889; 1888; 11
SAMPLE1; 253; 1890; 1889; 12
SAMPLE1; 253; 1911; 1910; -16
SAMPLE1; 253; 1913; 1911; -11
SAMPLE1; 253; 1914; 1913; -8
SAMPLE2; 261; 1992; 1991; -19
SAMPLE2; 261; 1994; 1992; -58
SAMPLE2; 261; 1995; 1994; -40
SAMPLE2; 261; 1996; 1995; -21
SAMPLE2; 261; 1997; 1996; -50
SAMPLE2; 261; 1998; 1997; -60
SAMPLE2; 261; 2005; 2004; -34
SAMPLE2; 261; 2006; 2005; -23
SAMPLE2; 261; 2007; 2006; -19
SAMPLE2; 261; 2008; 2007; -29
SAMPLE2; 261; 2009; 2008; -89
SAMPLE2; 261; 2013; 2009; -14
SAMPLE2; 261; 2014; 2013; -16
The end product I am aiming for are plots for each SAMPLE where on the x axis the SURVEY_YEAR is plotted and on the y axis the later calculated cumulative sum CUMSUM of the VALUE. My code so far to sort out the data:
# Filter out all values with less than 3 measurements by group (in this case does nothing, but is important with the rest of my data)
df <-read.csv("input.csv", sep=";", header=TRUE)
rowsn <- with(df,by(VALUE,ID,function(xx)sum(!is.na(xx))))
names(which(rowsn>=3))
dat <- df[df$ID %in% names(which(rowsn>=3)),]
# write new column which defines the beginning of the group (split by ID) and for the cumsum function(=0)
dat <- do.call(rbind, lapply(split(dat, dat$ID), function(x){
x <- rbind(x[1,],x); x[1, "VALUE"] <- 0; x[1, "SURVEY_YEAR"] <- x[1, "SURVEY_YEAR"] -1; return(x)}))
rownames(dat) <- seq_len(nrow(dat))
# write dat to csv file for inspection
write.table(dat, "dat.csv", sep=";", row.names=FALSE)
This results in the following dataframe which is the starting point for the calculation of the cumulative sum of the row VALUE.
NAME; ID; SURVEY_YEAR; REFERENCE_YEAR; VALUE
SAMPLE1; 253; 1879; 1879; 0
SAMPLE1; 253; 1880; 1879; 14
SAMPLE1; 253; 1881; 1880; -10
SAMPLE1; 253; 1882; 1881; 4
SAMPLE1; 253; 1883; 1882; 10
SAMPLE1; 253; 1884; 1883; 10
SAMPLE1; 253; 1885; 1884; 12
SAMPLE1; 253; 1889; 1888; 11
SAMPLE1; 253; 1890; 1889; 12
SAMPLE1; 253; 1911; 1910; -16
SAMPLE1; 253; 1913; 1911; -11
SAMPLE1; 253; 1914; 1913; -8
SAMPLE2; 261; 1991; 1991; 0
SAMPLE2; 261; 1992; 1991; -19
SAMPLE2; 261; 1994; 1992; -58
SAMPLE2; 261; 1995; 1994; -40
SAMPLE2; 261; 1996; 1995; -21
SAMPLE2; 261; 1997; 1996; -50
SAMPLE2; 261; 1998; 1997; -60
SAMPLE2; 261; 2005; 2004; -34
SAMPLE2; 261; 2006; 2005; -23
SAMPLE2; 261; 2007; 2006; -19
SAMPLE2; 261; 2008; 2007; -29
SAMPLE2; 261; 2009; 2008; -89
SAMPLE2; 261; 2013; 2009; -14
SAMPLE2; 261; 2014; 2013; -16
The problem now is that I would like to calculate the cumulative sum of the row VALUE for each year. As you can see I have gaps between certain years (for example in SAMPLE1 between 1890 and 1911 and in SAMPLE2 between 1998 and 2005) and I would like to fill the gaps for each year inbetween with NA values so that I can plot with plot type='b' (points and lines) and so that the different gaps are not connected. What is important that if there are multiple NA values after each other, in the CUMSUM row the last NA value should be replaced with the last numerical value before..
The normal case is that the difference between the REFERENCE_YEAR and the SURVEY_YEAR equals 1 (e.g for the first example of SAMPLE1 from 1880 to 1881), but in some cases there are varying periods between the REFERENCE_YEAR and the SURVEY_YEAR (e.g. in SAMPLE1 from 1911 to 1913 and in SAMPLE2 from 2009 to 2013). If this is the case the function of cumulative sum should only be applied once and the value should stay the same for the period indicated (in the plot this results in a straight line that is connected).
Its difficult to explain everything in detail and maybe its easier if I provide an example of what the result should look like:
NAME; ID; SURVEY_YEAR; REFERENCE_YEAR; VALUE; CUMSUM
SAMPLE1; 253; 1879; 1879; 0; 0
SAMPLE1; 253; 1880; 1879; 14; 14
SAMPLE1; 253; 1881; 1880; -10; 4
SAMPLE1; 253; 1882; 1881; 4; 8
SAMPLE1; 253; 1883; 1882; 10; 18
SAMPLE1; 253; 1884; 1883; 10; 28
SAMPLE1; 253; 1885; 1884; 12; 40
SAMPLE1; 253; 1886; 1885; NA; NA
SAMPLE1; 253; 1887; 1886; NA; NA
SAMPLE1; 253; 1888; 1887; NA; 40
SAMPLE1; 253; 1889; 1888; 11; 51
SAMPLE1; 253; 1890; 1889; 12; 63
SAMPLE1; 253; 1891; 1890; NA; NA
SAMPLE1; 253; 1892; 1891; NA; NA
SAMPLE1; 253; 1893; 1892; NA; NA
SAMPLE1; 253; 1894; 1893; NA; NA
SAMPLE1; 253; 1895; 1894; NA; NA
SAMPLE1; 253; 1896; 1895; NA; NA
SAMPLE1; 253; 1897; 1896; NA; NA
SAMPLE1; 253; 1898; 1897; NA; NA
SAMPLE1; 253; 1899; 1898; NA; NA
SAMPLE1; 253; 1900; 1899; NA; NA
SAMPLE1; 253; 1901; 1900; NA; NA
SAMPLE1; 253; 1902; 1901; NA; NA
SAMPLE1; 253; 1903; 1902; NA; NA
SAMPLE1; 253; 1904; 1903; NA; NA
SAMPLE1; 253; 1905; 1904; NA; NA
SAMPLE1; 253; 1906; 1905; NA; NA
SAMPLE1; 253; 1907; 1906; NA; NA
SAMPLE1; 253; 1908; 1907; NA; NA
SAMPLE1; 253; 1909; 1908; NA; NA
SAMPLE1; 253; 1910; 1909; NA; 63
SAMPLE1; 253; 1911; 1910; -16; 47
SAMPLE1; 253; 1912; 1911; -11; 36
SAMPLE1; 253; 1913; 1912; -11; 36
SAMPLE1; 253; 1914; 1913; -8; 28
SAMPLE2; 253; 1991; 1991; 0; 0
SAMPLE2; 253; 1992; 1991; -19; -19
SAMPLE2; 253; 1993; 1992; -58; -77
SAMPLE2; 253; 1994; 1993; -58; -135
SAMPLE2; 253; 1995; 1994; -40; -175
SAMPLE2; 253; 1996; 1995; -21; -196
SAMPLE2; 253; 1997; 1996; -50; -246
SAMPLE2; 253; 1998; 1997; -60; -306
SAMPLE2; 253; 1999; 1998; NA; NA
SAMPLE2; 253; 2000; 1999; NA; NA
SAMPLE2; 253; 2001; 2000; NA; NA
SAMPLE2; 253; 2002; 2001; NA; NA
SAMPLE2; 253; 2003; 2002; NA; NA
SAMPLE2; 253; 2004; 2003; NA; -306
SAMPLE2; 253; 2005; 2004; -34; -340
SAMPLE2; 253; 2006; 2005; -23; -363
SAMPLE2; 253; 2007; 2006; -19; -382
SAMPLE2; 253; 2008; 2007; -29; -411
SAMPLE2; 253; 2009; 2008; -89; -500
SAMPLE2; 253; 2010; 2009; -14; -514
SAMPLE2; 253; 2011; 2010; -14; -514
SAMPLE2; 253; 2012; 2011; -14; -514
SAMPLE2; 253; 2013; 2012; -14; -514
SAMPLE2; 253; 2014; 2013; -16; -530
Help with this rather complicated case would be very much appreciated! Thank you!
回答1:
BIG EDIT: Posted code, added correct library calls
library(dplyr)
df = read.csv("input.csv", sep=";", stringsAsFactors=FALSE)
#find min/max year for each SAMPLE
df_minmax = df %>%
group_by(NAME) %>%
summarise(min_year = min(SURVEY_YEAR),
max_year = max(SURVEY_YEAR))
#create an empty dataframe with what we want
df2 = data.frame(NAME = "",
ID = 0,
SURVEY_YEAR = min(df$SURVEY_YEAR):max(df$SURVEY_YEAR),
REFERENCE_YEAR = min(df$SURVEY_YEAR):max(df$SURVEY_YEAR) - 1,
VALUE = NA, stringsAsFactors=FALSE)
#fill in the NAMES dataframe - there's probably a better way to do this
for(i in 1:nrow(df_minmax)) {
min_year = df_minmax[i, ]$min_year
max_year = df_minmax[i, ]$max_year
df2[df2$SURVEY_YEAR %in% min_year:max_year, ]$NAME = df_minmax[i, ]$NAME
}
#fill in the values
#this line is a bit dangerous -- it relies on the fact that df1 and df2 have the same relative ordering
#don't change the ordering of df and df2 before this line.
df2[df2$SURVEY_YEAR %in% df$SURVEY_YEAR, ]$VALUE = df$VALUE
#in this example there is a long period between sample1 and sample2 we can filter those out
df2 = df2 %>% filter(NAME != "")
#Now we can do all the cumulative stuff
#for purposes of cumulative sums, set NA to 0
temp = df2$VALUE
df2[is.na(df2)] = 0
df2 = df2 %>% group_by(NAME) %>% mutate(csum = cumsum(VALUE))
#get back the NA values -- in case the NA values are useful to you
df2$VALUE = temp
Here's `head(df2):
NAME ID SURVEY_YEAR REFERENCE_YEAR VALUE csum
1 SAMPLE1 0 1880 1879 14 14
2 SAMPLE1 0 1881 1880 -10 4
3 SAMPLE1 0 1882 1881 4 8
4 SAMPLE1 0 1883 1882 10 18
5 SAMPLE1 0 1884 1883 10 28
6 SAMPLE1 0 1885 1884 12 40
7 SAMPLE1 0 1886 1885 NA 40
8 SAMPLE1 0 1887 1886 NA 40
9 SAMPLE1 0 1888 1887 NA 40
10 SAMPLE1 0 1889 1888 11 51
11 SAMPLE1 0 1890 1889 12 63
12 SAMPLE1 0 1891 1890 NA 63
13 SAMPLE1 0 1892 1891 NA 63
14 SAMPLE1 0 1893 1892 NA 63
15 SAMPLE1 0 1894 1893 NA 63
16 SAMPLE1 0 1895 1894 NA 63
17 SAMPLE1 0 1896 1895 NA 63
18 SAMPLE1 0 1897 1896 NA 63
19 SAMPLE1 0 1898 1897 NA 63
20 SAMPLE1 0 1899 1898 NA 63
Here's the outline of the steps I did above as a quick summary:
- Find the min/max year for each group in NAME.
- Create an empty dataframe that has the total range of all the years we want.
- Fill in the NAMES in the correct place in new empty dataframe.
- Fill in the VALUES in the correct place in new empty dataframe.
- Set NA's to 0 for purposes of cumulative sums
- Find cumulative sums by group.
- Replace the 0 back into NAs.
It's a bit hackish with the for
loop. I'm hoping no one strings me up for it.
来源:https://stackoverflow.com/questions/25245076/r-applying-cumulative-sum-function-and-filling-data-gaps-with-na-for-plotting