Calculating most frequent level by category with plyr

前端 未结 2 1055
失恋的感觉
失恋的感觉 2021-01-19 23:44

I would like calculate the most frequent factor level by category with plyr using the code below. The data frame b shows the requested result. Why does

相关标签:
2条回答
  • 2021-01-20 00:17

    You have pretty much exclusively used existing function names in your example: levels, cat, and mode. Generally, that doesn't create much of a problem--for example, calling a data.frame "df" doesn't break R's df() function. But it almost always leads to more ambiguous or confusing code, and in this case, it made things "break". Arun's answer does a great job of showing why.

    You can easily fix your problem by renaming your "mode" function. In the example below, I've simplified it a little bit in addition to renaming it, and it works as you expected.

    Mode <- function(x) names(which.max(table(x)))
    ddply(a, .(cat), summarise,
          mlevels=Mode(levels))
    #   cat mlevels
    # 1   1       6
    # 2   2       5
    # 3   3       9
    

    Of course, there's a really cumbersome workaround: Use get and specify where to search for the function.

    > mode <- function(x) names(table(x))[which.max(table(x))]
    > ddply(a, .(cat), summarise, mlevels = get("mode", ".GlobalEnv")(levels))
      cat mlevels
    1   1       6
    2   2       5
    3   3       9
    
    0 讨论(0)
  • 2021-01-20 00:29

    When you use summarise, plyr seems to "not see" the function declared in the global environment before checking for function in base:

    We can check this using Hadley's handy pryr package. You can install it by these commands:

    library(devtools)
    install_github("pryr")
    
    
    require(pryr)
    require(plyr)
    c <- ddply(a, .(cat), summarise, print(where("mode")))
    # <environment: namespace:base>
    # <environment: namespace:base>
    # <environment: namespace:base>
    

    Basically, it doesn't read/know/see your mode function. There are two alternatives. The first is what @AnandaMahto suggested and I'd do the same and would advice you to stick with it. The other alternative is to not use summarise and call it using function(.) so that the mode function in your global environment is "seen".

    c <- ddply(a, .(cat), function(x) mode(x$levels))
    #   cat V1
    # 1   1  6
    # 2   2  5
    # 3   3  9
    

    Why does this work?

    c <- ddply(a, .(cat), function(x) print(where("mode")))
    # <environment: R_GlobalEnv>
    # <environment: R_GlobalEnv>
    # <environment: R_GlobalEnv>
    

    Because as you see above, it reads your function that sits in the global environment.

    > mode # your function
    # function(x)
    #     names(table(x))[which.max(table(x))]
    > environment(mode) # where it sits
    # <environment: R_GlobalEnv>
    

    as opposed to:

    > base::mode # base's mode function
    # function (x) 
    # {
    #     some lines of code to compute mode
    # }
    # <bytecode: 0x7fa2f2bff878>
    # <environment: namespace:base>
    

    Here's an awesome wiki on environments from Hadley if you're interested in giving it a reading/exploring further.

    0 讨论(0)
提交回复
热议问题