问题
I am trying to get some summary statistics (mean, variance and quantiles) from a data vector with tied values. In particular, it is stored in a frequency distribution table: unique data values var
and number of ties frequency
.
I know I could use rep
function to first expand the vector to its full format:
xx <- rep(mydata$var, mydata$frequency)
then do standard
mean(xx)
var(xx)
quantile(xx)
But the frequency is really large and I have many unique values, which makes the program really slow. Is there a way to compute these statistics directly from var
and frequency
?
回答1:
set.seed(0)
x <- runif(10) ## unique data values
k <- sample.int(5, 10, TRUE) ## frequency
n <- sum(k)
xx <- rep.int(x, k) ## "expanded" data
#################
## sample mean ##
#################
mean(xx) ## using `xx`
#[1] 0.6339458
mu <- c(crossprod(x, k)) / n ## using `x` and `k`
#[1] 0.6339458
#####################
## sample variance ##
#####################
var(xx) * (n - 1) / n ## using `xx`
#[1] 0.06862544
v <- c(crossprod(x ^ 2, k)) / n - mu * mu ## using `x` and `k`
#[1] 0.06862544
Computing quantiles are much more involved, but doable. We need to first understand how quantiles are computed in a standard way.
xx <- sort(xx)
pp <- seq(0, 1, length = n)
plot(pp, xx); abline(v = pp, col = 8, lty = 2)
The standard quantile computation is a linear interpolation problem. However, when data have ties, we can clearly see that there are "runs" (of the same value) and "jumps" (between two values) in the plot. Linear interpolation is only needed on "jumps", while on "runs" the quantiles are just the run values.
The following function finds quantiles only using x
and k
. For demonstration purpose there is an argument verbose
. If TRUE
it will produce a plot and a data frame containing information of "runs" (and "jumps").
find_quantile <- function (x, k, prob = seq(0, 1, length = 5), verbose = FALSE) {
if (is.unsorted(x)) {
ind <- order(x); x <- x[ind]; k <- k[ind]
}
m <- length(x) ## number of unique values
n <- sum(k) ## number of data
d <- 1 / (n - 1) ## break [0, 1] into (n - 1) intervals
## the right and left end of each run
r <- (cumsum(k) - 1) * d
l <- r - (k - 1) * d
if (verbose) {
breaks <- seq(0, 1, d)
plot(r, x, "n", xlab = "prob (p)", ylab = "quantile (xq)", xlim = c(0, 1))
abline(v = breaks, col = 8, lty = 2)
## sketch each run
segments(l, x, r, x, lwd = 3)
## sketch each jump
segments(r[-m], x[-m], l[-1], x[-1], lwd = 3, col = 2)
## sketch `prob`
abline(v = prob, col = 3)
print( data.frame(x, k, l, r) )
}
## initialize the vector of quantiles
xq <- numeric(length(prob))
run <- rbind(l, r)
i <- findInterval(prob, run, rightmost.closed = TRUE)
## odd integers in `i` means that `prob` lies on runs
## quantiles on runs are just run values
on_run <- (i %% 2) != 0
run_id <- (i[on_run] + 1) / 2
xq[on_run] <- x[run_id]
## even integers in `i` means that `prob` lies on jumps
## quantiles on jumps are linear interpolations
on_jump <- !on_run
jump_id <- i[on_jump] / 2
xl <- x[jump_id] ## x-value to the left of the jump
xr <- x[jump_id + 1] ## x-value to the right of the jump
pl <- r[jump_id] ## percentile to the left of the jump
pr <- l[jump_id + 1] ## percentile to the right of the jump
p <- prob[on_jump] ## probability on the jump
## evaluate the line `(pl, xl) -- (pr, xr)` at `p`
xq[on_jump] <- (xr - xl) / (pr - pl) * (p - pl) + xl
xq
}
Applying the function to the example data above with verbose = TRUE
gives:
result <- find_quantile(x, k, prob = seq(0, 1, length = 5), TRUE)
# x k l r
#1 0.2016819 4 0.0000000 0.1111111
#2 0.2655087 2 0.1481481 0.1851852
#3 0.3721239 1 0.2222222 0.2222222
#4 0.5728534 4 0.2592593 0.3703704
#5 0.6291140 2 0.4074074 0.4444444
#6 0.6607978 5 0.4814815 0.6296296
#7 0.8966972 1 0.6666667 0.6666667
#8 0.8983897 3 0.7037037 0.7777778
#9 0.9082078 2 0.8148148 0.8518519
#10 0.9446753 4 0.8888889 1.0000000
Each row of the data frame is a "run". x
gives the run values, k
is the run length, and l
and r
are the left and right percentile of the run. In the figure, "runs" are drawn in black horizontal lines.
Information of "jumps" is implied by the r
, x
values of a row and the l
, x
values of the next row. In the figure, "jumps" are drawn in red lines.
The vertical green lines signals the prob
values we give.
The computed quantiles are
result
#[1] 0.2016819 0.5226710 0.6607978 0.8983897 0.9446753
which are identical to
quantile(xx, names = FALSE)
#[1] 0.2016819 0.5226710 0.6607978 0.8983897 0.9446753
来源:https://stackoverflow.com/questions/52225843/compute-sample-statistics-for-a-data-vector-with-ties-which-is-stored-as-a-frequ