How to calculate p-values for pairwise correlation of columns in Pandas?

后端未结

关注

 4  1016

Pandas has the very handy function to do pairwise correlation of columns using pd.corr(). That means it is possible to compare correlations between columns of any length. For in

相关标签:

4条回答

自闭症患者

2021-02-09 11:35

Probably just loop. It's basically what pandas does in the source code to generate the correlation matrix anyway:

import pandas as pd
import numpy as np
from scipy import stats

df_corr = pd.DataFrame() # Correlation matrix
df_p = pd.DataFrame()  # Matrix of p-values
for x in df.columns:
    for y in df.columns:
        corr = stats.pearsonr(df[x], df[y])
        df_corr.loc[x,y] = corr[0]
        df_p.loc[x,y] = corr[1]

If you want to leverage the fact that this is symmetric, so you only need to calculate this for roughly half of them, then do:

mat = df.values.T
K = len(df.columns)
correl = np.empty((K,K), dtype=float)
p_vals = np.empty((K,K), dtype=float)

for i, ac in enumerate(mat):
    for j, bc in enumerate(mat):
        if i > j:
            continue
        else:
            corr = stats.pearsonr(ac, bc)
            #corr = stats.kendalltau(ac, bc)

        correl[i,j] = corr[0]
        correl[j,i] = corr[0]
        p_vals[i,j] = corr[1]
        p_vals[j,i] = corr[1]

df_p = pd.DataFrame(p_vals)
df_corr = pd.DataFrame(correl)
#pd.concat([df_corr, df_p], keys=['corr', 'p_val'])

0 讨论(0)

栀梦

2021-02-09 11:42

This will work:

from scipy.stats import pearsonr

column_values = [column for column in df.columns.tolist() ]


df['Correlation_coefficent'], df['P-value'] = zip(*df.T.apply(lambda x: pearsonr(x[column_values ],x[column_values ])))
df_result = df[['Correlation_coefficent','P-value']]

0 讨论(0)

有刺的猬

2021-02-09 11:50

Does this work for you?

#call the correlation function, you could round the values if needed
df_c = df_c.corr().round(1)
#get the p values
pval = df_c.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(*rho.shape)
#set the p values, *** for less than 0.001, ** for less than 0.01, * for less than 0.05
p = pval.applymap(lambda x: ''.join(['*' for t in [0.001,0.01,0.05] if x<=t]))
#dfc_2 below will give you the dataframe with correlation coefficients and p values
df_c2 = df_c.astype(str) + p

#you could also plot the correlation matrix using sns.heatmap if you want
#plot the triangle
matrix = np.triu(df_c.corr())
#convert to array for the heatmap
df_c3 = df_c2.to_numpy()

#plot the heatmap
plt.figure(figsize=(13,8))
sns.heatmap(df_c, annot = df_c3, fmt='', vmin=-1, vmax=1, center= 0, cmap= 'coolwarm', mask = matrix)

0 讨论(0)

栀梦

2021-02-09 11:53
Why not using the "method" argument of pandas.DataFrame.corr():
- pearson : standard correlation coefficient.
- kendall : Kendall Tau correlation coefficient.
- spearman : Spearman rank correlation.
- callable: callable with input two 1d ndarrays and returning a float.
```
from scipy.stats import kendalltau, pearsonr, spearmanr

    def kendall_pval(x,y):
        return kendalltau(x,y)[1]
    
    def pearsonr_pval(x,y):
        return pearsonr(x,y)[1]
    
    def spearmanr_pval(x,y):
        return spearmanr(x,y)[1]
```
and then
```
corr = df.corr(method=pearsonr_pval)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...