Pandas Correlation Groupby

前端 未结 3 2027
灰色年华
灰色年华 2020-12-02 19:04

Assuming I have a dataframe similar to the below, how would I get the correlation between 2 specific columns and then group by the \'ID\' column? I believe the Pandas \'cor

相关标签:
3条回答
  • 2020-12-02 19:29

    In the above answer; since ix has been depreciated use iloc instead with some minor other changes:

    df.groupby('ID')[['Val1','Val2']].corr().iloc[0::2][['Val2']] # to get pandas DataFrame
    

    or

    df.groupby('ID')[['Val1','Val2']].corr().iloc[0::2]['Val2'] # to get pandas Series
    
    0 讨论(0)
  • 2020-12-02 19:35

    One more simple solution:

    df.groupby('ID')[['Val1','Val2']].corr().unstack().iloc[:,1]
    
    0 讨论(0)
  • 2020-12-02 19:38

    You pretty much figured out all the pieces, just need to combine them:

    >>> df.groupby('ID')[['Val1','Val2']].corr()
    
                 Val1      Val2
    ID                         
    A  Val1  1.000000  0.500000
       Val2  0.500000  1.000000
    B  Val1  1.000000  0.385727
       Val2  0.385727  1.000000
    

    In your case, printing out a 2x2 for each ID is excessively verbose. I don't see an option to print a scalar correlation instead of the whole matrix, but you can do something simple like this if you only have two variables:

    >>> df.groupby('ID')[['Val1','Val2']].corr().iloc[0::2,-1]
    
    ID       
    A   Val1    0.500000
    B   Val1    0.385727
    

    For the more general case of 3+ variables

    For 3 or more variables, it is not straightforward to create concise output but you could do something like this:

    groups = list('Val1', 'Val2', 'Val3', 'Val4')
    df2 = pd.DataFrame()
    for i in range( len(groups)-1): 
        df2 = df2.append( df.groupby('ID')[groups].corr().stack()
                            .loc[:,groups[i],groups[i+1]:].reset_index() )
    
    df2.columns = ['ID', 'v1', 'v2', 'corr']
    df2.set_index(['ID','v1','v2']).sort_index()
    

    Note that if we didn't have the groupby element, it would be straightforward to use an upper or lower triangle function from numpy. But since that element is present, it is not so easy to produce concise output in a more elegant manner as far as I can tell.

    0 讨论(0)
提交回复
热议问题