Measures of association in R — Kendall's tau-b and tau-c

前端 未结 9 855

Are there any R packages for the calculation of Kendall\'s tau-b and tau-c, and their associated standard errors? My searches on Google and Rseek have turned up nothing, but su

9条回答
  •  栀梦
    栀梦 (楼主)
    2021-01-30 10:17

    There are three Kendall tau statistics (tau-a, tau-b, and tau-c).

    They are not interchangeable, and none of the answers posted so far deal with the last two, which is the subject of the OP's question.

    I was unable to find functions to calculate tau-b or tau-c, either in the R Standard Library (stat et al.) or in any of the Packages available on CRAN or other repositories. I used the excellent R Package sos to search, so i believe results returned were reasonably thorough.

    So that's the short answer to the OP's Question: no built-in or Package function for tau-b or tau-c.

    But it's easy to roll your own.

    Writing R functions for the Kendall statistics is just a matter of translating these equations into code:

    Kendall_tau_a = (P - Q) / (n * (n - 1) / 2)
    
    Kendall_tau_b = (P - Q) / ( (P + Q + Y0) * (P + Q + X0) ) ^ 0.5 
    
    Kendall_tau_c = (P - Q) * ((2 * m) / n ^ 2 * (m - 1) )
    

    tau-a: equal to concordant minus discordant pairs, divided by a factor to account for total number of pairs (sample size).

    tau-b: explicit accounting for ties--i.e., both members of the data pair have the same value; this value is equal to concordant minus discordant pairs divided by a term representing the geometric mean between the number of pairs not tied on x (X0) and the number not tied on y (Y0).

    tau-c: larger-table variant also optimized for non-square tables; equal to concordant minus discordant pairs multiplied by a factor that adjusts for table size).

    # Number of concordant pairs.
    P = function(t) {
      r_ndx = row(t)
      c_ndx = col(t)
      sum(t * mapply(function(r, c){sum(t[(r_ndx > r) & (c_ndx > c)])},
        r = r_ndx, c = c_ndx))
    }
    
    # Number of discordant pairs.
    Q = function(t) {
      r_ndx = row(t)
      c_ndx = col(t)
      sum(t * mapply( function(r, c){
          sum(t[(r_ndx > r) & (c_ndx < c)])
      },
        r = r_ndx, c = c_ndx) )
    }
    
    # Sample size (total number of pairs).
    n = n = sum(t)
    
    # The lesser of number of rows or columns.
    m = min(dim(t))
    

    So these four parameters are all you need to calculate tau-a, tau-b, and tau-c:

    • P

    • Q

    • m

    • n

    (plus XO & Y0 for tau-b)


    For instance, the code for tau-c is:

    kendall_tau_c = function(t){
        t = as.matrix(t) 
        m = min(dim(t))
        n = sum(t)
        ks_tauc = (m * 2 * (P(t) - Q(t))) / ((n ^ 2) * (m - 1))
    }
    

    So how are Kendall's tau statistics related to the other statistical tests used in categorical data analysis?

    All three Kendall tau statistics, along with Goodman's and Kruskal's gamma are for correlation of ordinal and binary data. (The Kendall tau statistics are more sophisticated alternatives to the gamma statistic (just P-Q).)

    And so Kendalls's tau and the gamma are counterparts to the simple chi-square and Fisher's exact tests, both of which are (as far as I know) suitable only for nominal data.

    example:

    cpa_group = c(4, 2, 4, 3, 2, 2, 3, 2, 1, 5, 5, 1)
    revenue_per_customer_group = c(3, 3, 1, 3, 4, 4, 4, 3, 5, 3, 2, 2)
    weight = c(1, 3, 3, 2, 2, 4, 0, 4, 3, 0, 1, 1)
    
    dfx = data.frame(CPA=cpa_group, LCV=revenue_per_customer_group, freq=weight)
    
    # Reshape data frame so 1 row for each event 
    # (predicate step to create contingency table).
    dfx2 = data.frame(lapply(dfx, function(x) { rep(x, dfx$freq)}))
    
    t = xtabs(~ revenue + cpa, dfx)
    
    kc = kendall_tau_c(t)
    
    # Returns -.35.
    

提交回复
热议问题