numpy corrcoef - compute correlation matrix while ignoring missing data

前端 未结 3 1807
故里飘歌
故里飘歌 2020-12-29 06:15

I am trying to compute a correlation matrix of several values. These values include some \'nan\' values. I\'m using numpy.corrcoef. For element(i,j) of the output correla

3条回答
  •  小鲜肉
    小鲜肉 (楼主)
    2020-12-29 06:57

    One of the main features of pandas is being NaN friendly. To calculate correlation matrix, simply call df_counties.corr(). Below is an example to demonstrate df.corr() is NaN tolerant whereas np.corrcoef is not.

    import pandas as pd
    import numpy as np
    
    # data
    # ==============================
    np.random.seed(0)
    df = pd.DataFrame(np.random.randn(100,5), columns=list('ABCDE'))
    df[df < 0] = np.nan
    df
    
             A       B       C       D       E
    0   1.7641  0.4002  0.9787  2.2409  1.8676
    1      NaN  0.9501     NaN     NaN  0.4106
    2   0.1440  1.4543  0.7610  0.1217  0.4439
    3   0.3337  1.4941     NaN  0.3131     NaN
    4      NaN  0.6536  0.8644     NaN  2.2698
    5      NaN  0.0458     NaN  1.5328  1.4694
    6   0.1549  0.3782     NaN     NaN     NaN
    7   0.1563  1.2303  1.2024     NaN     NaN
    8      NaN     NaN     NaN  1.9508     NaN
    9      NaN     NaN  0.7775     NaN     NaN
    ..     ...     ...     ...     ...     ...
    90     NaN  0.8202  0.4631  0.2791  0.3389
    91  2.0210     NaN     NaN  0.1993     NaN
    92     NaN     NaN     NaN  0.1813     NaN
    93  2.4125     NaN     NaN     NaN  0.2515
    94     NaN     NaN     NaN     NaN  1.7389
    95  0.9944  1.3191     NaN  1.1286  0.4960
    96  0.7714  1.0294     NaN     NaN  0.8626
    97     NaN  1.5133  0.5531     NaN  0.2205
    98     NaN     NaN  1.1003  1.2980  2.6962
    99     NaN     NaN     NaN     NaN     NaN
    
    [100 rows x 5 columns]
    
    # calculations
    # ================================
    df.corr()
    
            A       B       C       D       E
    A  1.0000  0.2718  0.2678  0.2822  0.1016
    B  0.2718  1.0000 -0.0692  0.1736 -0.1432
    C  0.2678 -0.0692  1.0000 -0.3392  0.0012
    D  0.2822  0.1736 -0.3392  1.0000  0.1562
    E  0.1016 -0.1432  0.0012  0.1562  1.0000
    
    
    np.corrcoef(df, rowvar=False)
    
    array([[ nan,  nan,  nan,  nan,  nan],
           [ nan,  nan,  nan,  nan,  nan],
           [ nan,  nan,  nan,  nan,  nan],
           [ nan,  nan,  nan,  nan,  nan],
           [ nan,  nan,  nan,  nan,  nan]])
    

提交回复
热议问题