Pearson correlation and nan values

问题

I have two CSV_files with hundreds of columns and I want to calculate Pearson correlation coefficient and p value for every same columns of two CSV_files. The problem is that when there is a missing data "NaN" in one column, it gives me an error. When ".dropna" removes nan value from columns, sometimes the shapes of X and Y are not equal (based on removed nan values) and I receive this error:

"ValueError: operands could not be broadcast together with shapes (1020,) (1016,)"

Question: If row #8 in one csv in "nan", is there any way to remove the same row from the other csv too and do the analysis for every column based on rows that have values from both csv files?

import pandas as pd
import scipy
import csv
import numpy as np
from scipy import stats


df = pd.read_csv ("D:/Insitu-Daily.csv",header = None)
dg = pd.read_csv ("D:/Model-Daily.csv",header = None)

pearson_corr_set = []
pearson_p_set = []


for i in range(1,df.shape[1]):
    X= df[i].dropna(axis=0, how='any')
    Y= dg[i].dropna(axis=0, how='any')

    [pearson_corr, pearson_p] = scipy.stats.stats.pearsonr(X, Y)
    pearson_corr_set = np.append(pearson_corr_set,pearson_corr)
    pearson_p_set = np.append(pearson_p_set,pearson_p)

with open('D:/Results.csv','wb') as file:
    str1 = ",".join(str(i) for i in np.asarray(pearson_corr_set))
    file.write(str1)
    file.write('\n')    
    str1 = ",".join(str(i) for i in np.asarray(pearson_p_set))
    file.write(str1)
    file.write('\n')

回答1:

Here is one solution. First calculate the "bad" indices for your 2 numpy arrays. Then mask to ignore those bad indices.

x = np.array([5, 1, 6, 9, 10, np.nan, 1, 1, np.nan])
y = np.array([4, 4, 5, np.nan, 6, 2, 1, 8, 1])

bad = ~np.logical_or(np.isnan(x), np.isnan(y))

np.compress(bad, x)  # array([  5.,   1.,   6.,  10.,   1.,   1.])
np.compress(bad, y)  # array([ 4.,  4.,  5.,  6.,  1.,  8.])

回答2:

Instead of dropna, try using isnan and boolean indexing:

for i in range(1, df.shape[1]):
    df_sub = df[i]
    dg_sub = dg[i]
    mask = ~np.isnan(df_sub) & ~np.isnan(dg_sub)  
    # mask array is now true where ith rows of df and dg are NOT nan.
    X = df_sub[mask]  # this returns a 1D array of length mask.sum()
    Y = df_sub[mask]
    ... your code continues.

Hope that helps!

回答3:

Why not combine them to one single df and just use dropna on it. all values will be removed.

newdf=pd.concat([df, dg], axis=1, sort=False)
newdf.dropna()

I suggest to get a list of column names of both df, and use that in the for loop:

dfnames=list(df.columns.values)
dgnames=list(dg.columns.values)
for i in range(len(dfnames)):
    X= newdf[dfnames[i]].dropna(axis=0, how='any')
    Y= newdf[dgnames[i]].dropna(axis=0, how='any')

    [pearson_corr, pearson_p] = scipy.stats.stats.pearsonr(X, Y)
    pearson_corr_set = np.append(pearson_corr_set,pearson_corr)
    pearson_p_set = np.append(pearson_p_set,pearson_p)

also, you can just csv withtout that for loop. read pandas.DataFrame.to_csv

来源：https://stackoverflow.com/questions/48591713/pearson-correlation-and-nan-values

标签

python

arrays

numpy

nan

pearson-correlation