I would like calculate the most frequent factor level by category with plyr using the code below. The data frame b
shows the requested result. Why does
You have pretty much exclusively used existing function names in your example: levels
, cat
, and mode
. Generally, that doesn't create much of a problem--for example, calling a data.frame "df" doesn't break R's df()
function. But it almost always leads to more ambiguous or confusing code, and in this case, it made things "break". Arun's answer does a great job of showing why.
You can easily fix your problem by renaming your "mode" function. In the example below, I've simplified it a little bit in addition to renaming it, and it works as you expected.
Mode <- function(x) names(which.max(table(x)))
ddply(a, .(cat), summarise,
mlevels=Mode(levels))
# cat mlevels
# 1 1 6
# 2 2 5
# 3 3 9
Of course, there's a really cumbersome workaround: Use get
and specify where to search for the function.
> mode <- function(x) names(table(x))[which.max(table(x))]
> ddply(a, .(cat), summarise, mlevels = get("mode", ".GlobalEnv")(levels))
cat mlevels
1 1 6
2 2 5
3 3 9
When you use summarise
, plyr
seems to "not see" the function declared in the global environment before checking for function in base
:
We can check this using Hadley's handy pryr
package. You can install it by these commands:
library(devtools)
install_github("pryr")
require(pryr)
require(plyr)
c <- ddply(a, .(cat), summarise, print(where("mode")))
# <environment: namespace:base>
# <environment: namespace:base>
# <environment: namespace:base>
Basically, it doesn't read/know/see your mode
function. There are two alternatives. The first is what @AnandaMahto suggested and I'd do the same and would advice you to stick with it. The other alternative is to not use summarise
and call it using function(.)
so that the mode
function in your global environment is "seen".
c <- ddply(a, .(cat), function(x) mode(x$levels))
# cat V1
# 1 1 6
# 2 2 5
# 3 3 9
Why does this work?
c <- ddply(a, .(cat), function(x) print(where("mode")))
# <environment: R_GlobalEnv>
# <environment: R_GlobalEnv>
# <environment: R_GlobalEnv>
Because as you see above, it reads your function that sits in the global environment
.
> mode # your function
# function(x)
# names(table(x))[which.max(table(x))]
> environment(mode) # where it sits
# <environment: R_GlobalEnv>
as opposed to:
> base::mode # base's mode function
# function (x)
# {
# some lines of code to compute mode
# }
# <bytecode: 0x7fa2f2bff878>
# <environment: namespace:base>
Here's an awesome wiki on environments
from Hadley if you're interested in giving it a reading/exploring further.