问题
I am having some trouble using the ddply function from the plyr package. I am trying to summarise the following data with counts and proportions within each group. Here's my data:
structure(list(X5employf = structure(c(1L, 3L, 1L, 1L, 1L, 3L,
1L, 1L, 1L, 3L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 1L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 1L, 1L, 3L, 1L,
3L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L,
3L, 3L, 1L), .Label = c("increase", "decrease", "same"), class = "factor"),
X5employff = structure(c(2L, 6L, NA, 2L, 4L, 6L, 5L, 2L,
2L, 8L, 2L, 2L, 2L, 7L, 7L, 8L, 11L, 7L, 2L, 8L, 8L, 11L,
7L, 6L, 2L, 5L, 2L, 8L, 7L, 7L, 7L, 8L, 6L, 7L, 5L, 5L, 7L,
2L, 6L, 7L, 2L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 2L, 5L, 2L, 2L,
2L, 5L, 12L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 5L, 2L, 5L, 2L,
13L, 9L, 9L, 9L, 7L, 8L, 5L), .Label = c("", "1", "1 and 8",
"2", "3", "4", "5", "6", "6 and 7", "6 and 7 ", "7", "8",
"1 and 8"), class = "factor")), .Names = c("X5employf", "X5employff"
), row.names = c(NA, 73L), class = "data.frame")
And here's my call using ddply:
ddply(kano_final, .(X5employf, X5employff), summarise, n=length(X5employff), prop=(n/sum(n))*100)
This gives me the counts of each instance of X5employff
correctly, but but seems as though the proportion is being calculated across each row and not within each level of the factor X5employf
as follows:
X5employf X5employff n prop
1 increase 1 26 100
2 increase 2 1 100
3 increase 3 15 100
4 increase 1 and 8 1 100
5 increase <NA> 1 100
6 decrease 4 1 100
7 decrease 5 5 100
8 decrease 6 2 100
9 decrease 7 1 100
10 decrease 8 1 100
11 same 4 4 100
12 same 5 6 100
13 same 6 5 100
14 same 6 and 7 3 100
15 same 7 1 100
When manually calculating the proportions within each group I get this:
X5employf X5employff n prop
1 increase 1 26 59.09
2 increase 2 1 2.27
3 increase 3 15 34.09
4 increase 1 and 8 1 2.27
5 increase <NA> 1 2.27
6 decrease 4 1 10.00
7 decrease 5 5 50.00
8 decrease 6 2 20.00
9 decrease 7 1 10.00
10 decrease 8 1 10.00
11 same 4 4 21.05
12 same 5 6 31.57
13 same 6 5 26.31
14 same 6 and 7 3 15.78
15 same 7 1 5.26
As you can see the sum of proportions in each level of factor X5employf equals 100.
I know this is probably ridiculously simple, but I can't seem to get my head around it despite reading all sorts of similar posts. Can anyone help with this and my understanding of how the summarise function works?!
Many, many thanks
Marty
回答1:
You cannot do it in one ddply
call because what gets passed to each summarize
call is a subset of your data for a specific combination of your group variables. At this lowest level, you do not have access to that intermediate level sum(n)
. Instead, do it in two steps:
kano_final <- ddply(kano_final, .(X5employf), transform,
sum.n = length(X5employf))
ddply(kano_final, .(X5employf, X5employff), summarise,
n = length(X5employff), prop = n / sum.n[1] * 100)
Edit: using a single ddply
call and using table
as you hinted towards:
ddply(kano_final, .(X5employf), summarise,
n = Filter(function(x) x > 0, table(X5employff, useNA = "ifany")),
prop = 100* prop.table(n),
X5employff = names(n))
回答2:
I'd add here an example with dplyr which makes it quite easily in one step, with a short-code and easy-to-read syntax.
d is your data.frame
library(dplyr)
d%.%
dplyr:::group_by(X5employf, X5employff) %.%
dplyr:::summarise(n = length(X5employff)) %.%
dplyr:::mutate(ngr = sum(n)) %.%
dplyr:::mutate(prop = n/ngr*100)
will result in
Source: local data frame [15 x 5]
Groups: X5employf
X5employf X5employff n ngr prop
1 increase 1 26 44 59.090909
2 increase 2 1 44 2.272727
3 increase 3 15 44 34.090909
4 increase 1 and 8 1 44 2.272727
5 increase NA 1 44 2.272727
6 decrease 4 1 10 10.000000
7 decrease 5 5 10 50.000000
8 decrease 6 2 10 20.000000
9 decrease 7 1 10 10.000000
10 decrease 8 1 10 10.000000
11 same 4 4 19 21.052632
12 same 5 6 19 31.578947
13 same 6 5 19 26.315789
14 same 6 and 7 3 19 15.789474
15 same 7 1 19 5.263158
回答3:
What you apparently want to do is to find out the proportions of X5employff for every value of X5employf. However, you don't tell ddply that X5employf and X5employff are different; to ddply, these two variables are just two variables to split up the data. Also, since there is one observation per line, i.e. count = 1 for every line of the data, the length of each (X5employf, X5employff) combination equals the sum of each (X5employf, X5employff) combination.
The simplest "plyr way" to solve your problem that I can think of is the following:
result <- ddply(kano_final, .(X5employf, X5employff), summarise, n=length(X5employff), drop=FALSE)
n <- result$n
n2 <- ddply(kano_final, .(X5employf), summarise, n=length(X5employff))$n
result <- data.frame(result, prop=n/rep(n2, each=13)*100)
You can also use good old xtabs:
a <- xtabs(~X5employf + X5employff, kano_final)
b <- xtabs(~X5employf, kano_final)
a/matrix(b, nrow=3, ncol=ncol(a))
来源:https://stackoverflow.com/questions/18057081/ddply-summarise-proportional-count