How to perform three variable correlation with Python Pandas

前端 未结 1 1174
轻奢々
轻奢々 2020-12-19 17:58

Pandas corr() function limits its use to pairwise calculation. But how do you calculate the correlation of three variables in a data frame using salary as the d

相关标签:
1条回答
  • 2020-12-19 18:23

    You can calculate the correlation of a dependent variable with two other independent variables by first getting the correlation coefficients of the pairs with pandas. Then you can use a multiple correlation coefficient function to calculate the R-squared, this however is slightly biased, so you may opt for the more accurate adjusted R-squared value. You can also adjust the equation to take into account more independent variables. The following is a python adaptation of an excellent article by Mr. Charles Zaiontz. http://www.real-statistics.com/correlation/multiple-correlation/

    import math
    
    df = pd.DataFrame({
        'IQ':[100,140,90,85,120,110,95], 
        'GPA':[3.2,4.0,2.9,2.5,3.6,3.4,3.0],
        'SALARY':[45e3,150e3,30e3,25e3,75e3,60e3,38e3]
        })
    
    # Get pairwise correlation coefficients
    cor = df.corr()
    
    # Independent variables
    x = 'IQ'
    y = 'GPA'
    
    # Dependent variable
    z = 'SALARY'
    
    # Pairings
    xz = cor.loc[ x, z ]
    yz = cor.loc[ y, z ]
    xy = cor.loc[ x, y ]
    
    Rxyz = math.sqrt((abs(xz**2) + abs(yz**2) - 2*xz*yz*xy) / (1-abs(xy**2)) )
    R2 = Rxyz**2
    
    # Calculate adjusted R-squared
    n = len(df) # Number of rows
    k = 2       # Number of independent variables
    R2_adj = 1 - ( ((1-R2)*(n-1)) / (n-k-1) )
    

    R2,R2_adj = 0.958, 0.956

    Results show that salary is almost 96% dependent on/correlated with IQ and GPA.

    0 讨论(0)
提交回复
热议问题