I have a data frame df
(which can be downloaded here) referred to a register of companies that looks something like this:
Provider.ID Local.Authority month year entry exit total
1 1-102642676 Warwickshire 10 2010 2 0 2
2 1-102642676 Bury 10 2010 1 0 1
3 1-102642676 Kent 10 2010 1 0 1
4 1-102642676 Essex 10 2010 1 0 1
5 1-102642676 Lambeth 10 2010 2 0 2
6 1-102642676 East Sussex 10 2010 5 0 5
7 1-102642676 Bristol, City of 10 2010 1 0 1
8 1-102642676 Liverpool 10 2010 1 0 1
9 1-102642676 Merton 10 2010 1 0 1
10 1-102642676 Cheshire East 10 2010 2 0 2
11 1-102642676 Knowsley 10 2010 1 0 1
12 1-102642676 North Yorkshire 10 2010 1 0 1
13 1-102642676 Kingston upon Thames 10 2010 1 0 1
14 1-102642676 Lewisham 10 2010 1 0 1
15 1-102642676 Wiltshire 10 2010 1 0 1
16 1-102642676 Hampshire 10 2010 1 0 1
17 1-102642676 Wandsworth 10 2010 1 0 1
18 1-102642676 Brent 10 2010 1 0 1
19 1-102642676 West Sussex 10 2010 1 0 1
20 1-102642676 Windsor and Maidenhead 10 2010 1 0 1
21 1-102642676 Luton 10 2010 1 0 1
22 1-102642676 Enfield 10 2010 1 0 1
23 1-102642676 Somerset 10 2010 1 0 1
24 1-102642676 Cambridgeshire 10 2010 1 0 1
25 1-102642676 Hillingdon 10 2010 1 0 1
26 1-102642676 Havering 10 2010 1 0 1
27 1-102642676 Solihull 10 2010 1 0 1
28 1-102642676 Bexley 10 2010 1 0 1
29 1-102642676 Sandwell 10 2010 1 0 1
30 1-102642676 Southampton 10 2010 1 0 1
31 1-102642676 Trafford 10 2010 1 0 1
32 1-102642676 Newham 10 2010 1 0 1
33 1-102642676 West Berkshire 10 2010 1 0 1
34 1-102642676 Reading 10 2010 1 0 1
35 1-102642676 Hartlepool 10 2010 1 0 1
36 1-102642676 Hampshire 3 2011 1 0 1
37 1-102642676 Kent 9 2011 0 1 -1
38 1-102642676 North Yorkshire 12 2011 0 1 -1
39 1-102642676 North Somerset 12 2012 2 0 2
40 1-102642676 Kent 10 2014 1 0 1
41 1-102642676 Somerset 1 2016 0 1 -1
My goal is to create a variable that reflects the cumulative sum of the last variable (total
) for each Local.Authority
and each year
. total
is just the difference between entry
and exit
. I have tried to carry out this operation by applying dplyr
on the following basis:
library(dplyr)
df.1 = df %>% group_by(Local.Authority, year) %>%
mutate(cum.total = cumsum(total)) %>%
arrange(year, month, Local.Authority)
Yielding the (wrong) result below:
> df.1
Source: local data frame [41 x 8]
Groups: Local.Authority, year [41]
Provider.ID Local.Authority month year entry exit total cum.total
<fctr> <fctr> <int> <int> <int> <int> <int> <int>
1 1-102642676 Bexley 10 2010 1 0 1 35
2 1-102642676 Brent 10 2010 1 0 1 25
3 1-102642676 Bristol, City of 10 2010 1 0 1 13
4 1-102642676 Bury 10 2010 1 0 1 3
5 1-102642676 Cambridgeshire 10 2010 1 0 1 31
6 1-102642676 Cheshire East 10 2010 2 0 2 17
7 1-102642676 East Sussex 10 2010 5 0 5 12
8 1-102642676 Enfield 10 2010 1 0 1 29
9 1-102642676 Essex 10 2010 1 0 1 5
10 1-102642676 Hampshire 10 2010 1 0 1 23
.. ... ... ... ... ... ... ... ...
I have confirmed these results by checking levels in the variable Local.Authority
that appear in various years (for example Kent):
> check = df.1 %>% filter(Local.Authority == "Kent")
> check
Source: local data frame [3 x 8]
Groups: Local.Authority, year [3]
Provider.ID Local.Authority month year entry exit total cum.total
<fctr> <fctr> <int> <int> <int> <int> <int> <int>
1 1-102642676 Kent 10 2010 1 0 1 4
2 1-102642676 Kent 9 2011 0 1 -1 42
3 1-102642676 Kent 10 2014 1 0 1 44
Where it should be:
Provider.ID Local.Authority month year entry exit total cum.total
<fctr> <fctr> <int> <int> <int> <int> <int> <int>
1 1-102642676 Kent 10 2010 1 0 1 1
2 1-102642676 Kent 9 2011 0 1 -1 0
3 1-102642676 Kent 10 2014 1 0 1 1
Would anyone know what may be happening for getting these results out of the cumsum? Many thanks in advance.
When you group by local.Authority & year it takes unique values and print the result as 1,-1,1 so better group by only local.Authority where cumsum works based on total values and result 1,0,1
df <- df %>%
group_by(Local.Authority) %>%
mutate(cum.to = cumsum(total))
> df
Source: local data frame [3 x 8]
Groups: Local.Authority [1]
Provider.ID Local.Authority month year entry exit total cum.to
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1-102642676 Kent 10 2010 1 0 1 1
2 1-102642676 Kent 9 2011 0 1 -1 0
3 1-102642676 Kent 10 2014 1 0 1 1
I got the solution to my problem. I restarted my session and I got my result grouping just by Local Authority and then arranging:
> df.1 = df %>% group_by(Local.Authority) %>%
+ mutate(cum.total = cumsum(total)) %>%
+ arrange(year, month, Local.Authority)
> df.1
Source: local data frame [41 x 8]
Groups: Local.Authority [36]
Provider.ID Local.Authority month year entry exit total cum.total
<fctr> <fctr> <int> <int> <int> <int> <int> <int>
1 1-102642676 Bexley 10 2010 1 0 1 1
2 1-102642676 Brent 10 2010 1 0 1 1
3 1-102642676 Bristol, City of 10 2010 1 0 1 1
4 1-102642676 Bury 10 2010 1 0 1 1
5 1-102642676 Cambridgeshire 10 2010 1 0 1 1
6 1-102642676 Cheshire East 10 2010 2 0 2 2
7 1-102642676 East Sussex 10 2010 5 0 5 5
8 1-102642676 Enfield 10 2010 1 0 1 1
9 1-102642676 Essex 10 2010 1 0 1 1
10 1-102642676 Hampshire 10 2010 1 0 1 1
Checking "Kent" now it yields the expected result:
> check = df.1 %>% filter(Local.Authority == "Kent")
> check
Source: local data frame [3 x 8]
Groups: Local.Authority [1]
Provider.ID Local.Authority month year entry exit total cum.total
<fctr> <fctr> <int> <int> <int> <int> <int> <int>
1 1-102642676 Kent 10 2010 1 0 1 1
2 1-102642676 Kent 9 2011 0 1 -1 0
3 1-102642676 Kent 10 2014 1 0 1 1
来源:https://stackoverflow.com/questions/39080104/cumsum-in-grouped-data-with-dplyr