numpy corrcoef - compute correlation matrix while ignoring missing data

前端 未结 3 1808
故里飘歌
故里飘歌 2020-12-29 06:15

I am trying to compute a correlation matrix of several values. These values include some \'nan\' values. I\'m using numpy.corrcoef. For element(i,j) of the output correla

相关标签:
3条回答
  • 2020-12-29 06:50

    This will work, using the masked array numpy module:

    import numpy as np
    import numpy.ma as ma
    
    A = [1, 2, 3, 4, 5, np.NaN]
    B = [2, 3, 4, 5.25, np.NaN, 100]
    
    print(ma.corrcoef(ma.masked_invalid(A), ma.masked_invalid(B)))
    

    It outputs:

    [[1.0 0.99838143945703]
     [0.99838143945703 1.0]]
    

    Read more here: https://docs.scipy.org/doc/numpy/reference/maskedarray.generic.html

    0 讨论(0)
  • 2020-12-29 06:57

    One of the main features of pandas is being NaN friendly. To calculate correlation matrix, simply call df_counties.corr(). Below is an example to demonstrate df.corr() is NaN tolerant whereas np.corrcoef is not.

    import pandas as pd
    import numpy as np
    
    # data
    # ==============================
    np.random.seed(0)
    df = pd.DataFrame(np.random.randn(100,5), columns=list('ABCDE'))
    df[df < 0] = np.nan
    df
    
             A       B       C       D       E
    0   1.7641  0.4002  0.9787  2.2409  1.8676
    1      NaN  0.9501     NaN     NaN  0.4106
    2   0.1440  1.4543  0.7610  0.1217  0.4439
    3   0.3337  1.4941     NaN  0.3131     NaN
    4      NaN  0.6536  0.8644     NaN  2.2698
    5      NaN  0.0458     NaN  1.5328  1.4694
    6   0.1549  0.3782     NaN     NaN     NaN
    7   0.1563  1.2303  1.2024     NaN     NaN
    8      NaN     NaN     NaN  1.9508     NaN
    9      NaN     NaN  0.7775     NaN     NaN
    ..     ...     ...     ...     ...     ...
    90     NaN  0.8202  0.4631  0.2791  0.3389
    91  2.0210     NaN     NaN  0.1993     NaN
    92     NaN     NaN     NaN  0.1813     NaN
    93  2.4125     NaN     NaN     NaN  0.2515
    94     NaN     NaN     NaN     NaN  1.7389
    95  0.9944  1.3191     NaN  1.1286  0.4960
    96  0.7714  1.0294     NaN     NaN  0.8626
    97     NaN  1.5133  0.5531     NaN  0.2205
    98     NaN     NaN  1.1003  1.2980  2.6962
    99     NaN     NaN     NaN     NaN     NaN
    
    [100 rows x 5 columns]
    
    # calculations
    # ================================
    df.corr()
    
            A       B       C       D       E
    A  1.0000  0.2718  0.2678  0.2822  0.1016
    B  0.2718  1.0000 -0.0692  0.1736 -0.1432
    C  0.2678 -0.0692  1.0000 -0.3392  0.0012
    D  0.2822  0.1736 -0.3392  1.0000  0.1562
    E  0.1016 -0.1432  0.0012  0.1562  1.0000
    
    
    np.corrcoef(df, rowvar=False)
    
    array([[ nan,  nan,  nan,  nan,  nan],
           [ nan,  nan,  nan,  nan,  nan],
           [ nan,  nan,  nan,  nan,  nan],
           [ nan,  nan,  nan,  nan,  nan],
           [ nan,  nan,  nan,  nan,  nan]])
    
    0 讨论(0)
  • 2020-12-29 07:07

    In case you expect a different number of nans in each array, you may consider taking a logical AND of non-nan masks.

    import numpy as np
    import numpy.ma as ma
    
    a=ma.masked_invalid(A)
    b=ma.masked_invalid(B)
    
    msk = (~a.mask & ~b.mask)
    
    print(ma.corrcoef(a[msk],b[msk]))
    
    0 讨论(0)
提交回复
热议问题