List Highest Correlation Pairs from a Large Correlation Matrix in Pandas?

后端 未结 13 460
心在旅途
心在旅途 2020-12-22 17:45

How do you find the top correlations in a correlation matrix with Pandas? There are many answers on how to do this with R (Show correlations as an ordered list, not as a lar

相关标签:
13条回答
  • 2020-12-22 18:24

    You can use DataFrame.values to get an numpy array of the data and then use NumPy functions such as argsort() to get the most correlated pairs.

    But if you want to do this in pandas, you can unstack and sort the DataFrame:

    import pandas as pd
    import numpy as np
    
    shape = (50, 4460)
    
    data = np.random.normal(size=shape)
    
    data[:, 1000] += data[:, 2000]
    
    df = pd.DataFrame(data)
    
    c = df.corr().abs()
    
    s = c.unstack()
    so = s.sort_values(kind="quicksort")
    
    print so[-4470:-4460]
    

    Here is the output:

    2192  1522    0.636198
    1522  2192    0.636198
    3677  2027    0.641817
    2027  3677    0.641817
    242   130     0.646760
    130   242     0.646760
    1171  2733    0.670048
    2733  1171    0.670048
    1000  2000    0.742340
    2000  1000    0.742340
    dtype: float64
    
    0 讨论(0)
  • 2020-12-22 18:24

    Use itertools.combinations to get all unique correlations from pandas own correlation matrix .corr(), generate list of lists and feed it back into a DataFrame in order to use '.sort_values'. Set ascending = True to display lowest correlations on top

    corrank takes a DataFrame as argument because it requires .corr().

      def corrank(X: pandas.DataFrame):
            import itertools
            df = pd.DataFrame([[(i,j),X.corr().loc[i,j]] for i,j in list(itertools.combinations(X.corr(), 2))],columns=['pairs','corr'])    
            print(df.sort_values(by='corr',ascending=False))
    
      corrank(X) # prints a descending list of correlation pair (Max on top)
    
    0 讨论(0)
  • 2020-12-22 18:24

    I didn't want to unstack or over-complicate this issue, since I just wanted to drop some highly correlated features as part of a feature selection phase.

    So I ended up with the following simplified solution:

    # map features to their absolute correlation values
    corr = features.corr().abs()
    
    # set equality (self correlation) as zero
    corr[corr == 1] = 0
    
    # of each feature, find the max correlation
    # and sort the resulting array in ascending order
    corr_cols = corr.max().sort_values(ascending=False)
    
    # display the highly correlated features
    display(corr_cols[corr_cols > 0.8])
    

    In this case, if you want to drop correlated features, you may map through the filtered corr_cols array and remove the odd-indexed (or even-indexed) ones.

    0 讨论(0)
  • 2020-12-22 18:29

    Combining some features of @HYRY and @arun's answers, you can print the top correlations for dataframe df in a single line using:

    df.corr().unstack().sort_values().drop_duplicates()
    

    Note: the one downside is if you have 1.0 correlations that are not one variable to itself, the drop_duplicates() addition would remove them

    0 讨论(0)
  • 2020-12-22 18:33

    Few lines solution without redundant pairs of variables:

    corr_matrix = df.corr().abs()
    
    #the matrix is symmetric so we need to extract upper triangle matrix without diagonal (k = 1)
    
    sol = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
                      .stack()
                      .sort_values(ascending=False))
    
    #first element of sol series is the pair with the biggest correlation
    

    Then you can iterate through names of variables pairs (which are pandas.Series multi-indexes) and theirs values like this:

    for index, value in sol.items():
      # do some staff
    
    0 讨论(0)
  • 2020-12-22 18:33

    You can do graphically according to this simple code by substituting your data.

    corr = df.corr()
    
    kot = corr[corr>=.9]
    plt.figure(figsize=(12,8))
    sns.heatmap(kot, cmap="Greens")
    

    0 讨论(0)
提交回复
热议问题