How to find the statistical mode?

前端 未结 30 1695
时光取名叫无心
时光取名叫无心 2020-11-21 07:00

In R, mean() and median() are standard functions which do what you\'d expect. mode() tells you the internal storage mode of the objec

相关标签:
30条回答
  • 2020-11-21 08:01

    Here is a function to find the mode:

    mode <- function(x) {
      unique_val <- unique(x)
      counts <- vector()
      for (i in 1:length(unique_val)) {
        counts[i] <- length(which(x==unique_val[i]))
      }
      position <- c(which(counts==max(counts)))
      if (mean(counts)==max(counts)) 
        mode_x <- 'Mode does not exist'
      else 
        mode_x <- unique_val[position]
      return(mode_x)
    }
    
    0 讨论(0)
  • 2020-11-21 08:02

    Below is the code which can be use to find the mode of a vector variable in R.

    a <- table([vector])
    
    names(a[a==max(a)])
    
    0 讨论(0)
  • 2020-11-21 08:02

    I was looking through all these options and started to wonder about their relative features and performances, so I did some tests. In case anyone else are curious about the same, I'm sharing my results here.

    Not wanting to bother about all the functions posted here, I chose to focus on a sample based on a few criteria: the function should work on both character, factor, logical and numeric vectors, it should deal with NAs and other problematic values appropriately, and output should be 'sensible', i.e. no numerics as character or other such silliness.

    I also added a function of my own, which is based on the same rle idea as chrispy's, except adapted for more general use:

    library(magrittr)
    
    Aksel <- function(x, freq=FALSE) {
        z <- 2
        if (freq) z <- 1:2
        run <- x %>% as.vector %>% sort %>% rle %>% unclass %>% data.frame
        colnames(run) <- c("freq", "value")
        run[which(run$freq==max(run$freq)), z] %>% as.vector   
    }
    
    set.seed(2)
    
    F <- sample(c("yes", "no", "maybe", NA), 10, replace=TRUE) %>% factor
    Aksel(F)
    
    # [1] maybe yes  
    
    C <- sample(c("Steve", "Jane", "Jonas", "Petra"), 20, replace=TRUE)
    Aksel(C, freq=TRUE)
    
    # freq value
    #    7 Steve
    

    I ended up running five functions, on two sets of test data, through microbenchmark. The function names refer to their respective authors:

    Chris' function was set to method="modes" and na.rm=TRUE by default to make it more comparable, but other than that the functions were used as presented here by their authors.

    In matter of speed alone Kens version wins handily, but it is also the only one of these that will only report one mode, no matter how many there really are. As is often the case, there's a trade-off between speed and versatility. In method="mode", Chris' version will return a value iff there is one mode, else NA. I think that's a nice touch. I also think it's interesting how some of the functions are affected by an increased number of unique values, while others aren't nearly as much. I haven't studied the code in detail to figure out why that is, apart from eliminating logical/numeric as a the cause.

    0 讨论(0)
  • 2020-11-21 08:03

    One more solution, which works for both numeric & character/factor data:

    Mode <- function(x) {
      ux <- unique(x)
      ux[which.max(tabulate(match(x, ux)))]
    }
    

    On my dinky little machine, that can generate & find the mode of a 10M-integer vector in about half a second.

    If your data set might have multiple modes, the above solution takes the same approach as which.max, and returns the first-appearing value of the set of modes. To return all modes, use this variant (from @digEmAll in the comments):

    Modes <- function(x) {
      ux <- unique(x)
      tab <- tabulate(match(x, ux))
      ux[tab == max(tab)]
    }
    
    0 讨论(0)
  • 2020-11-21 08:03

    found this on the r mailing list, hope it's helpful. It is also what I was thinking anyways. You'll want to table() the data, sort and then pick the first name. It's hackish but should work.

    names(sort(-table(x)))[1]
    
    0 讨论(0)
  • 2020-11-21 08:03

    I can't vote yet but Rasmus Bååth's answer is what I was looking for. However, I would modify it a bit allowing to contrain the distribution for example fro values only between 0 and 1.

    estimate_mode <- function(x,from=min(x), to=max(x)) {
      d <- density(x, from=from, to=to)
      d$x[which.max(d$y)]
    }
    

    We aware that you may not want to constrain at all your distribution, then set from=-"BIG NUMBER", to="BIG NUMBER"

    0 讨论(0)
提交回复
热议问题