List Highest Correlation Pairs from a Large Correlation Matrix in Pandas?

后端 未结 13 458
心在旅途
心在旅途 2020-12-22 17:45

How do you find the top correlations in a correlation matrix with Pandas? There are many answers on how to do this with R (Show correlations as an ordered list, not as a lar

相关标签:
13条回答
  • 2020-12-22 18:09

    The following function should do the trick. This implementation

    • Removes self correlations
    • Removes duplicates
    • Enables the selection of top N highest correlated features

    and it is also configurable so that you can keep both the self correlations as well as the duplicates. You can also to report as many feature pairs as you wish.


    def get_feature_correlation(df, top_n=None, corr_method='spearman',
                                remove_duplicates=True, remove_self_correlations=True):
        """
        Compute the feature correlation and sort feature pairs based on their correlation
    
        :param df: The dataframe with the predictor variables
        :type df: pandas.core.frame.DataFrame
        :param top_n: Top N feature pairs to be reported (if None, all of the pairs will be returned)
        :param corr_method: Correlation compuation method
        :type corr_method: str
        :param remove_duplicates: Indicates whether duplicate features must be removed
        :type remove_duplicates: bool
        :param remove_self_correlations: Indicates whether self correlations will be removed
        :type remove_self_correlations: bool
    
        :return: pandas.core.frame.DataFrame
        """
        corr_matrix_abs = df.corr(method=corr_method).abs()
        corr_matrix_abs_us = corr_matrix_abs.unstack()
        sorted_correlated_features = corr_matrix_abs_us \
            .sort_values(kind="quicksort", ascending=False) \
            .reset_index()
    
        # Remove comparisons of the same feature
        if remove_self_correlations:
            sorted_correlated_features = sorted_correlated_features[
                (sorted_correlated_features.level_0 != sorted_correlated_features.level_1)
            ]
    
        # Remove duplicates
        if remove_duplicates:
            sorted_correlated_features = sorted_correlated_features.iloc[:-2:2]
    
        # Create meaningful names for the columns
        sorted_correlated_features.columns = ['Feature 1', 'Feature 2', 'Correlation (abs)']
    
        if top_n:
            return sorted_correlated_features[:top_n]
    
        return sorted_correlated_features
    
    
    0 讨论(0)
  • 2020-12-22 18:11

    I was trying some of the solutions here but then I actually came up with my own one. I hope this might be useful for the next one so I share it here:

    def sort_correlation_matrix(correlation_matrix):
        cor = correlation_matrix.abs()
        top_col = cor[cor.columns[0]][1:]
        top_col = top_col.sort_values(ascending=False)
        ordered_columns = [cor.columns[0]] + top_col.index.tolist()
        return correlation_matrix[ordered_columns].reindex(ordered_columns)
    
    0 讨论(0)
  • 2020-12-22 18:13

    @HYRY's answer is perfect. Just building on that answer by adding a bit more logic to avoid duplicate and self correlations and proper sorting:

    import pandas as pd
    d = {'x1': [1, 4, 4, 5, 6], 
         'x2': [0, 0, 8, 2, 4], 
         'x3': [2, 8, 8, 10, 12], 
         'x4': [-1, -4, -4, -4, -5]}
    df = pd.DataFrame(data = d)
    print("Data Frame")
    print(df)
    print()
    
    print("Correlation Matrix")
    print(df.corr())
    print()
    
    def get_redundant_pairs(df):
        '''Get diagonal and lower triangular pairs of correlation matrix'''
        pairs_to_drop = set()
        cols = df.columns
        for i in range(0, df.shape[1]):
            for j in range(0, i+1):
                pairs_to_drop.add((cols[i], cols[j]))
        return pairs_to_drop
    
    def get_top_abs_correlations(df, n=5):
        au_corr = df.corr().abs().unstack()
        labels_to_drop = get_redundant_pairs(df)
        au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
        return au_corr[0:n]
    
    print("Top Absolute Correlations")
    print(get_top_abs_correlations(df, 3))
    

    That gives the following output:

    Data Frame
       x1  x2  x3  x4
    0   1   0   2  -1
    1   4   0   8  -4
    2   4   8   8  -4
    3   5   2  10  -4
    4   6   4  12  -5
    
    Correlation Matrix
              x1        x2        x3        x4
    x1  1.000000  0.399298  1.000000 -0.969248
    x2  0.399298  1.000000  0.399298 -0.472866
    x3  1.000000  0.399298  1.000000 -0.969248
    x4 -0.969248 -0.472866 -0.969248  1.000000
    
    Top Absolute Correlations
    x1  x3    1.000000
    x3  x4    0.969248
    x1  x4    0.969248
    dtype: float64
    
    0 讨论(0)
  • 2020-12-22 18:16

    Use the code below to view the correlations in the descending order.

    # See the correlations in descending order
    
    corr = df.corr() # df is the pandas dataframe
    c1 = corr.abs().unstack()
    c1.sort_values(ascending = False)
    
    0 讨论(0)
  • 2020-12-22 18:17

    Lot's of good answers here. The easiest way I found was a combination of some of the answers above.

    corr = corr.where(np.triu(np.ones(corr.shape), k=1).astype(np.bool))
    corr = corr.unstack().transpose()\
        .sort_values(by='column', ascending=False)\
        .dropna()
    
    0 讨论(0)
  • 2020-12-22 18:21

    I liked Addison Klinke's post the most, as being the simplest, but used Wojciech Moszczyńsk’s suggestion for filtering and charting, but extended the filter to avoid absolute values, so given a large correlation matrix, filter it, chart it, and then flatten it:

    Created, Filtered and Charted

    dfCorr = df.corr()
    filteredDf = dfCorr[((dfCorr >= .5) | (dfCorr <= -.5)) & (dfCorr !=1.000)]
    plt.figure(figsize=(30,10))
    sn.heatmap(filteredDf, annot=True, cmap="Reds")
    plt.show()
    

    Function

    In the end, I created a small function to create the correlation matrix, filter it, and then flatten it. As an idea, it could easily be extended, e.g., asymmetric upper and lower bounds, etc.

    def corrFilter(x: pd.DataFrame, bound: float):
        xCorr = x.corr()
        xFiltered = xCorr[((xCorr >= bound) | (xCorr <= -bound)) & (xCorr !=1.000)]
        xFlattened = xFiltered.unstack().sort_values().drop_duplicates()
        return xFlattened
    
    corrFilter(df, .7)
    

    0 讨论(0)
提交回复
热议问题