Compute sample statistics for a data vector with ties which is stored as a frequency table

问题

I am trying to get some summary statistics (mean, variance and quantiles) from a data vector with tied values. In particular, it is stored in a frequency distribution table: unique data values var and number of ties frequency.

I know I could use rep function to first expand the vector to its full format:

xx <- rep(mydata$var, mydata$frequency)

then do standard

mean(xx)
var(xx)
quantile(xx)

But the frequency is really large and I have many unique values, which makes the program really slow. Is there a way to compute these statistics directly from var and frequency?

回答1:

set.seed(0)
x <- runif(10)                ## unique data values
k <- sample.int(5, 10, TRUE)  ## frequency

n <- sum(k)
xx <- rep.int(x, k)           ## "expanded" data

#################
## sample mean ##
#################

mean(xx)  ## using `xx`
#[1] 0.6339458

mu <- c(crossprod(x, k)) / n  ## using `x` and `k`
#[1] 0.6339458

#####################
## sample variance ##
#####################

var(xx) * (n - 1) / n  ## using `xx`
#[1] 0.06862544

v <- c(crossprod(x ^ 2, k)) / n - mu * mu  ## using `x` and `k`
#[1] 0.06862544

Computing quantiles are much more involved, but doable. We need to first understand how quantiles are computed in a standard way.

xx <- sort(xx)
pp <- seq(0, 1, length = n)
plot(pp, xx); abline(v = pp, col = 8, lty = 2)

The standard quantile computation is a linear interpolation problem. However, when data have ties, we can clearly see that there are "runs" (of the same value) and "jumps" (between two values) in the plot. Linear interpolation is only needed on "jumps", while on "runs" the quantiles are just the run values.

The following function finds quantiles only using x and k. For demonstration purpose there is an argument verbose. If TRUE it will produce a plot and a data frame containing information of "runs" (and "jumps").

find_quantile <- function (x, k, prob = seq(0, 1, length = 5), verbose = FALSE) {

  if (is.unsorted(x)) {
    ind <- order(x); x <- x[ind]; k <- k[ind]
    }

  m <- length(x)     ## number of unique values
  n <- sum(k)        ## number of data
  d <- 1 / (n - 1)   ## break [0, 1] into (n - 1) intervals

  ## the right and left end of each run
  r <- (cumsum(k) - 1) * d
  l <- r - (k - 1) * d

  if (verbose) {

    breaks <- seq(0, 1, d)
    plot(r, x, "n", xlab = "prob (p)", ylab = "quantile (xq)", xlim = c(0, 1))
    abline(v = breaks, col = 8, lty = 2)

    ## sketch each run
    segments(l, x, r, x, lwd = 3)

    ## sketch each jump
    segments(r[-m], x[-m], l[-1], x[-1], lwd = 3, col = 2)

    ## sketch `prob`
    abline(v = prob, col = 3)

    print( data.frame(x, k, l, r) )
    }

  ## initialize the vector of quantiles 
  xq <- numeric(length(prob))

  run <- rbind(l, r)
  i <- findInterval(prob, run, rightmost.closed = TRUE)

  ## odd integers in `i` means that `prob` lies on runs
  ## quantiles on runs are just run values
  on_run <- (i %% 2) != 0
  run_id <- (i[on_run] + 1) / 2
  xq[on_run] <- x[run_id]

  ## even integers in `i` means that `prob` lies on jumps
  ## quantiles on jumps are linear interpolations
  on_jump <- !on_run
  jump_id <- i[on_jump] / 2
  xl <- x[jump_id]      ## x-value to the left of the jump
  xr <- x[jump_id + 1]  ## x-value to the right of the jump
  pl <- r[jump_id]      ## percentile to the left of the jump
  pr <- l[jump_id + 1]  ## percentile to the right of the jump
  p  <- prob[on_jump]   ## probability on the jump
  ## evaluate the line `(pl, xl) -- (pr, xr)` at `p`
  xq[on_jump] <- (xr - xl) / (pr - pl) * (p - pl) + xl

  xq
  }

Applying the function to the example data above with verbose = TRUE gives:

result <- find_quantile(x, k, prob = seq(0, 1, length = 5), TRUE)

#           x k         l         r
#1  0.2016819 4 0.0000000 0.1111111
#2  0.2655087 2 0.1481481 0.1851852
#3  0.3721239 1 0.2222222 0.2222222
#4  0.5728534 4 0.2592593 0.3703704
#5  0.6291140 2 0.4074074 0.4444444
#6  0.6607978 5 0.4814815 0.6296296
#7  0.8966972 1 0.6666667 0.6666667
#8  0.8983897 3 0.7037037 0.7777778
#9  0.9082078 2 0.8148148 0.8518519
#10 0.9446753 4 0.8888889 1.0000000

Each row of the data frame is a "run". x gives the run values, k is the run length, and l and r are the left and right percentile of the run. In the figure, "runs" are drawn in black horizontal lines.

Information of "jumps" is implied by the r, x values of a row and the l, x values of the next row. In the figure, "jumps" are drawn in red lines.

The vertical green lines signals the prob values we give.

The computed quantiles are

result
#[1] 0.2016819 0.5226710 0.6607978 0.8983897 0.9446753

which are identical to

quantile(xx, names = FALSE)
#[1] 0.2016819 0.5226710 0.6607978 0.8983897 0.9446753

来源：https://stackoverflow.com/questions/52225843/compute-sample-statistics-for-a-data-vector-with-ties-which-is-stored-as-a-frequ

标签

statistics

mean

variance

quantile