Optimal way to compute pairwise mutual information using numpy

后端 未结 1 1920
一个人的身影
一个人的身影 2020-12-07 09:43

For an m x n matrix, what\'s the optimal (fastest) way to compute the mutual information for all pairs of columns (n x n)?

By mutual information, I

相关标签:
1条回答
  • 2020-12-07 10:29

    I can't suggest a faster calculation for the outer loop over the n*(n-1)/2 vectors, but your implementation of calc_MI(x, y, bins) can be simplified if you can use scipy version 0.13 or scikit-learn.

    In scipy 0.13, the lambda_ argument was added to scipy.stats.chi2_contingency This argument controls the statistic that is computed by the function. If you use lambda_="log-likelihood" (or lambda_=0), the log-likelihood ratio is returned. This is also often called the G or G2 statistic. Other than a factor of 2*n (where n is the total number of samples in the contingency table), this is the mutual information. So you could implement calc_MI as:

    from scipy.stats import chi2_contingency
    
    def calc_MI(x, y, bins):
        c_xy = np.histogram2d(x, y, bins)[0]
        g, p, dof, expected = chi2_contingency(c_xy, lambda_="log-likelihood")
        mi = 0.5 * g / c_xy.sum()
        return mi
    

    The only difference between this and your implementation is that this implementation uses the natural logarithm instead of the base-2 logarithm (so it is expressing the information in "nats" instead of "bits"). If you really prefer bits, just divide mi by log(2).

    If you have (or can install) sklearn (i.e. scikit-learn), you can use sklearn.metrics.mutual_info_score, and implement calc_MI as:

    from sklearn.metrics import mutual_info_score
    
    def calc_MI(x, y, bins):
        c_xy = np.histogram2d(x, y, bins)[0]
        mi = mutual_info_score(None, None, contingency=c_xy)
        return mi
    
    0 讨论(0)
提交回复
热议问题