How to calculate the Kolmogorov-Smirnov statistic between two weighted samples

问题

Let's say that we have two samples data1 and data2 with their respective weights weight1 and weight2 and that we want to calculate the Kolmogorov-Smirnov statistic between the two weighted samples.

The way we do that in python follows:

def ks_w(data1,data2,wei1,wei2):
    ix1=np.argsort(data1)
    ix2=np.argsort(data2)
    wei1=wei1[ix1]
    wei2=wei2[ix2]
    data1=data1[ix1]
    data2=data2[ix2]
    d=0.
    fn1=0.
    fn2=0.
    j1=0
    j2=0
    j1w=0.
    j2w=0.
    while(j1<len(data1))&(j2<len(data2)):
            d1=data1[j1]
            d2=data2[j2]
            w1=wei1[j1]
            w2=wei2[j2]
            if d1<=d2:
                    j1+=1
                    j1w+=w1
                    fn1=(j1w)/sum(wei1)
            if d2<=d1:
                    j2+=1
                    j2w+=w2
                    fn2=(j2w)/sum(wei2)
            if abs(fn2-fn1)>d:
                    d=abs(fn2-fn1)
    return d

where we just modify to our purpose the classical two-sample KS statistic as implemented in Press, Flannery, Teukolsky, Vetterling - Numerical Recipes in C - Cambridge University Press - 1992 - pag.626.

Our questions are:

is anybody aware of any other way to do it?
is there any library in python/R/* that performs it?
what about the test? Does it exist or should we use a reshuffling procedure in order to evaluate the statistic?

回答1:

Studying the scipy.stats.ks_2samp code we were able to find a more efficient python solution. We share below in case anyone is interested:

from __future__ import division  # (for python 2/3 support)

import numpy as np

def ks_w2(data1, data2, wei1, wei2):
    ix1 = np.argsort(data1)
    ix2 = np.argsort(data2)
    data1 = data1[ix1]
    data2 = data2[ix2]
    wei1 = wei1[ix1]
    wei2 = wei2[ix2]
    data = np.concatenate([data1, data2])
    cwei1 = np.hstack([0, np.cumsum(wei1)/sum(wei1)])
    cwei2 = np.hstack([0, np.cumsum(wei2)/sum(wei2)])
    cdf1we = cwei1[[np.searchsorted(data1, data, side='right')]]
    cdf2we = cwei2[[np.searchsorted(data2, data, side='right')]]
    return np.max(np.abs(cdf1we - cdf2we))

To evaluate the performance we performed the following test:

ds1 = random.rand(10000)
ds2 = random.randn(40000) + .2
we1 = random.rand(10000) + 1.
we2 = random.rand(40000) + 1.

ks_w2(ds1, ds2, we1, we2) took 11.7ms on our machine, while ks_w(ds1, ds2, we1, we2) took 1.43s

回答2:

This is a R version of a two-tails weighted KS statistic following the suggestion of Numerical Methods of Statistics by Monohan, pg. 334 in 1E and pg. 358 in 2E.

ks_weighted <- function(vector_1,vector_2,weights_1,weights_2){
    F_vec_1 <- ewcdf(vector_1, weights = weights_1, normalise=FALSE)
    F_vec_2 <- ewcdf(vector_2, weights = weights_2, normalise=FALSE)
    xw <- c(vector_1,vector_2) 
    d <- max(abs(F_vec_1(xw) - F_vec_2(xw)))

    ## P-VALUE with NORMAL SAMPLE 
    # n_vector_1 <- length(vector_1)                                                           
    # n_vector_2<- length(vector_2)        
    # n <- n_vector_1 * n_vector_2/(n_vector_1 + n_vector_2)

    # P-VALUE EFFECTIVE SAMPLE SIZE as suggested by Monahan
    n_vector_1 <- sum(weights_1)^2/sum(weights_1^2)
    n_vector_2 <- sum(weights_2)^2/sum(weights_2^2)
    n <- n_vector_1 * n_vector_2/(n_vector_1 + n_vector_2)


    pkstwo <- function(x, tol = 1e-06) {
                if (is.numeric(x)) 
                    x <- as.double(x)
                else stop("argument 'x' must be numeric")
                p <- rep(0, length(x))
                p[is.na(x)] <- NA
                IND <- which(!is.na(x) & (x > 0))
                if (length(IND)) 
                    p[IND] <- .Call(stats:::C_pKS2, p = x[IND], tol)
                p
            }

    pval <- 1 - pkstwo(sqrt(n) * d)

    out <- c(KS_Stat=d, P_value=pval)
    return(out)
}

来源：https://stackoverflow.com/questions/40044375/how-to-calculate-the-kolmogorov-smirnov-statistic-between-two-weighted-samples

标签

python

scipy

kolmogorov-smirnov