Variance Inflation Factor in Python

前端 未结 8 522
星月不相逢
星月不相逢 2020-12-22 23:04

I\'m trying to calculate the variance inflation factor (VIF) for each column in a simple dataset in python:

a b c d
1 2 4 4
1 2 6 3
2 3 7 4
3 2 8 5
4 1 9 4
         


        
相关标签:
8条回答
  • 2020-12-22 23:35

    I believe the reason for this is due to a difference in Python's OLS. OLS, which is used in the python variance inflation factor calculation, does not add an intercept by default. You definitely want an intercept in there however.

    What you'd want to do is add one more column to your matrix, ck, filled with ones to represent a constant. This will be the intercept term of the equation. Once this is done, your values should match out properly.

    Edited: replaced zeroes with ones

    0 讨论(0)
  • In case you don't wanna deal with variance_inflation_factor and add_constant. Please consider the following two functions.

    1. Use formula in statasmodels:

    import pandas as pd
    import statsmodels.formula.api as smf
    
    def get_vif(exogs, data):
        '''Return VIF (variance inflation factor) DataFrame
    
        Args:
        exogs (list): list of exogenous/independent variables
        data (DataFrame): the df storing all variables
    
        Returns:
        VIF and Tolerance DataFrame for each exogenous variable
    
        Notes:
        Assume we have a list of exogenous variable [X1, X2, X3, X4].
        To calculate the VIF and Tolerance for each variable, we regress
        each of them against other exogenous variables. For instance, the
        regression model for X3 is defined as:
                            X3 ~ X1 + X2 + X4
        And then we extract the R-squared from the model to calculate:
                        VIF = 1 / (1 - R-squared)
                        Tolerance = 1 - R-squared
        The cutoff to detect multicollinearity:
                        VIF > 10 or Tolerance < 0.1
        '''
    
        # initialize dictionaries
        vif_dict, tolerance_dict = {}, {}
    
        # create formula for each exogenous variable
        for exog in exogs:
            not_exog = [i for i in exogs if i != exog]
            formula = f"{exog} ~ {' + '.join(not_exog)}"
    
            # extract r-squared from the fit
            r_squared = smf.ols(formula, data=data).fit().rsquared
    
            # calculate VIF
            vif = 1/(1 - r_squared)
            vif_dict[exog] = vif
    
            # calculate tolerance
            tolerance = 1 - r_squared
            tolerance_dict[exog] = tolerance
    
        # return VIF DataFrame
        df_vif = pd.DataFrame({'VIF': vif_dict, 'Tolerance': tolerance_dict})
    
        return df_vif
    
    

    2. Use LinearRegression in sklearn:

    # import warnings
    # warnings.simplefilter(action='ignore', category=FutureWarning)
    import pandas as pd
    from sklearn.linear_model import LinearRegression
    
    def sklearn_vif(exogs, data):
    
        # initialize dictionaries
        vif_dict, tolerance_dict = {}, {}
    
        # form input data for each exogenous variable
        for exog in exogs:
            not_exog = [i for i in exogs if i != exog]
            X, y = data[not_exog], data[exog]
    
            # extract r-squared from the fit
            r_squared = LinearRegression().fit(X, y).score(X, y)
    
            # calculate VIF
            vif = 1/(1 - r_squared)
            vif_dict[exog] = vif
    
            # calculate tolerance
            tolerance = 1 - r_squared
            tolerance_dict[exog] = tolerance
    
        # return VIF DataFrame
        df_vif = pd.DataFrame({'VIF': vif_dict, 'Tolerance': tolerance_dict})
    
        return df_vif
    
    

    Example:

    import seaborn as sns
    
    df = sns.load_dataset('car_crashes')
    exogs = ['alcohol', 'speeding', 'no_previous', 'not_distracted']
    
    [In] %%timeit -n 100
    get_vif(exogs=exogs, data=df)
    
    [Out]
                          VIF   Tolerance
    alcohol          3.436072   0.291030
    no_previous      3.113984   0.321132
    not_distracted   2.668456   0.374749
    speeding         1.884340   0.530690
    
    69.6 ms ± 8.96 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    [In] %%timeit -n 100
    sklearn_vif(exogs=exogs, data=df)
    
    [Out]
                          VIF   Tolerance
    alcohol          3.436072   0.291030
    no_previous      3.113984   0.321132
    not_distracted   2.668456   0.374749
    speeding         1.884340   0.530690
    
    15.7 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    0 讨论(0)
提交回复
热议问题