Using pandas, calculate Cramér's coefficient matrix

前端 未结 4 818
名媛妹妹
名媛妹妹 2020-12-23 10:18

I have a dataframe in pandas which contains metrics calculated on Wikipedia articles. Two categorical variables nation which nation the article is

相关标签:
4条回答
  • 2020-12-23 10:57

    There is a far simpler answer. So the question is on Cramer's V, and I will stick to answering this.

    For your pandas DataFrame: data, if you're only interested in the language and nation columns you can easily get a heatmap of Cramer's V using the simple few lines below:

    # first chose your category columns of interest
    df = data[['nation', 'lang']]
    
    # now change this to dummy variables, one-hot encoded:
    DataMatrix = pd.get_dummies(df)
    
    # plot as simply as:
    plt.figure(figsize=(15,12))  # for large datasets
    plt.title('Cramer\'s V comparing nation and language')
    sns.heatmap(DataMatrix.corr('pearson'), cmap='coolwarm', center=0)
    
    

    Alternatives I can recommend are: 2 by 2 chi-squared tests of proportions, or asymmetric normalised mutual information (NMI or Theil's U).

    0 讨论(0)
  • 2020-12-23 11:05

    cramers V seems pretty over optimistic in a few tests that I did. Wikipedia recommends a corrected version.

    def cramers_corrected_stat(confusion_matrix):
        """ calculate Cramers V statistic for categorial-categorial association.
            uses correction from Bergsma and Wicher, 
            Journal of the Korean Statistical Society 42 (2013): 323-328
        """
        chi2 = ss.chi2_contingency(confusion_matrix)[0]
        n = confusion_matrix.sum()
        phi2 = chi2/n
        r,k = confusion_matrix.shape
        phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
        rcorr = r - ((r-1)**2)/(n-1)
        kcorr = k - ((k-1)**2)/(n-1)
        return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))
    

    Also note that the confusion matrix can be calculated via a built-in pandas method for categorical columns via:

    import pandas as pd
    confusion_matrix = pd.crosstab(df[column1], df[column2])
    
    0 讨论(0)
  • 2020-12-23 11:05

    A bit modificated function from Ziggy Eunicien answer. 2 modifications added 1) cheching one variable is constant 2) correction to ss.chi2_contingency(conf_matrix, correction=correct) - FALSE if confusion matrix is 2x2

    import scipy.stats as ss
    import pandas as pd
    import numpy as np
    def cramers_corrected_stat(x,y):
    
        """ calculate Cramers V statistic for categorial-categorial association.
            uses correction from Bergsma and Wicher, 
            Journal of the Korean Statistical Society 42 (2013): 323-328
        """
        result=-1
        if len(x.value_counts())==1 :
            print("First variable is constant")
        elif len(y.value_counts())==1:
            print("Second variable is constant")
        else:   
            conf_matrix=pd.crosstab(x, y)
    
            if conf_matrix.shape[0]==2:
                correct=False
            else:
                correct=True
    
            chi2 = ss.chi2_contingency(conf_matrix, correction=correct)[0]
    
            n = sum(conf_matrix.sum())
            phi2 = chi2/n
            r,k = conf_matrix.shape
            phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
            rcorr = r - ((r-1)**2)/(n-1)
            kcorr = k - ((k-1)**2)/(n-1)
            result=np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))
        return round(result,6)
    
    0 讨论(0)
  • 2020-12-23 11:10

    Cramer's V statistic allows to understand correlation between two categorical features in one data set. So, it is your case.

    To calculate Cramers V statistic you need to calculate confusion matrix. So, solution steps are:
    1. Filter data for a single metric
    2. Calculate confusion matrix
    3. Calculate Cramers V statistic

    Of course, you can do those steps in loop nest provided in your post. But in your starting paragraph you mention only metrics as an outer parameter, so I am not sure that you need both loops. Now, I will provide code for steps 2-3, because filtering is simple and as I mentioned I am not sure what you certainely need.

    Step 2. In the code below data is a pandas.dataFrame filtered by whatever you want on step 1.

    import numpy as np
    
    confusions = []
    for nation in list_of_nations:
        for language in list_of_languges:
            cond = data['nation'] == nation and data['lang'] == language
            confusions.append(cond.sum())
    confusion_matrix = np.array(confusions).reshape(len(list_of_nations), len(list_of_languges))
    

    Step 3. In the code below confusion_matrix is a numpy.ndarray obtained on step 2.

    import numpy as np
    import scipy.stats as ss
    
    def cramers_stat(confusion_matrix):
        chi2 = ss.chi2_contingency(confusion_matrix)[0]
        n = confusion_matrix.sum()
        return np.sqrt(chi2 / (n*(min(confusion_matrix.shape)-1)))
    
    result = cramers_stat(confusion_matrix)
    

    This code was tested on my data set, but I hope it is ok to use it without changes in your case.

    0 讨论(0)
提交回复
热议问题