How to calculate p-values for pairwise correlation of columns in Pandas?

后端 未结 4 981
执念已碎
执念已碎 2021-02-09 11:01

Pandas has the very handy function to do pairwise correlation of columns using pd.corr(). That means it is possible to compare correlations between columns of any length. For in

相关标签:
4条回答
  • 2021-02-09 11:35

    Probably just loop. It's basically what pandas does in the source code to generate the correlation matrix anyway:

    import pandas as pd
    import numpy as np
    from scipy import stats
    
    df_corr = pd.DataFrame() # Correlation matrix
    df_p = pd.DataFrame()  # Matrix of p-values
    for x in df.columns:
        for y in df.columns:
            corr = stats.pearsonr(df[x], df[y])
            df_corr.loc[x,y] = corr[0]
            df_p.loc[x,y] = corr[1]
    

    If you want to leverage the fact that this is symmetric, so you only need to calculate this for roughly half of them, then do:

    mat = df.values.T
    K = len(df.columns)
    correl = np.empty((K,K), dtype=float)
    p_vals = np.empty((K,K), dtype=float)
    
    for i, ac in enumerate(mat):
        for j, bc in enumerate(mat):
            if i > j:
                continue
            else:
                corr = stats.pearsonr(ac, bc)
                #corr = stats.kendalltau(ac, bc)
    
            correl[i,j] = corr[0]
            correl[j,i] = corr[0]
            p_vals[i,j] = corr[1]
            p_vals[j,i] = corr[1]
    
    df_p = pd.DataFrame(p_vals)
    df_corr = pd.DataFrame(correl)
    #pd.concat([df_corr, df_p], keys=['corr', 'p_val'])
    
    0 讨论(0)
  • 2021-02-09 11:42

    This will work:

    from scipy.stats import pearsonr
    
    column_values = [column for column in df.columns.tolist() ]
    
    
    df['Correlation_coefficent'], df['P-value'] = zip(*df.T.apply(lambda x: pearsonr(x[column_values ],x[column_values ])))
    df_result = df[['Correlation_coefficent','P-value']]
    
    0 讨论(0)
  • 2021-02-09 11:50

    Does this work for you?

    #call the correlation function, you could round the values if needed
    df_c = df_c.corr().round(1)
    #get the p values
    pval = df_c.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(*rho.shape)
    #set the p values, *** for less than 0.001, ** for less than 0.01, * for less than 0.05
    p = pval.applymap(lambda x: ''.join(['*' for t in [0.001,0.01,0.05] if x<=t]))
    #dfc_2 below will give you the dataframe with correlation coefficients and p values
    df_c2 = df_c.astype(str) + p
    
    #you could also plot the correlation matrix using sns.heatmap if you want
    #plot the triangle
    matrix = np.triu(df_c.corr())
    #convert to array for the heatmap
    df_c3 = df_c2.to_numpy()
    
    #plot the heatmap
    plt.figure(figsize=(13,8))
    sns.heatmap(df_c, annot = df_c3, fmt='', vmin=-1, vmax=1, center= 0, cmap= 'coolwarm', mask = matrix)
    
    0 讨论(0)
  • 2021-02-09 11:53

    Why not using the "method" argument of pandas.DataFrame.corr():

    • pearson : standard correlation coefficient.
    • kendall : Kendall Tau correlation coefficient.
    • spearman : Spearman rank correlation.
    • callable: callable with input two 1d ndarrays and returning a float.
    from scipy.stats import kendalltau, pearsonr, spearmanr
    
        def kendall_pval(x,y):
            return kendalltau(x,y)[1]
        
        def pearsonr_pval(x,y):
            return pearsonr(x,y)[1]
        
        def spearmanr_pval(x,y):
            return spearmanr(x,y)[1]
    

    and then

    corr = df.corr(method=pearsonr_pval)
    
    0 讨论(0)
提交回复
热议问题