Variance Inflation Factor in Python

前端 未结 8 521
星月不相逢
星月不相逢 2020-12-22 23:04

I\'m trying to calculate the variance inflation factor (VIF) for each column in a simple dataset in python:

a b c d
1 2 4 4
1 2 6 3
2 3 7 4
3 2 8 5
4 1 9 4
         


        
相关标签:
8条回答
  • 2020-12-22 23:22

    As mentioned by others and in this post by Josef Perktold, the function's author, variance_inflation_factor expects the presence of a constant in the matrix of explanatory variables. One can use add_constant from statsmodels to add the required constant to the dataframe before passing its values to the function.

    from statsmodels.stats.outliers_influence import variance_inflation_factor
    from statsmodels.tools.tools import add_constant
    
    df = pd.DataFrame(
        {'a': [1, 1, 2, 3, 4],
         'b': [2, 2, 3, 2, 1],
         'c': [4, 6, 7, 8, 9],
         'd': [4, 3, 4, 5, 4]}
    )
    
    X = add_constant(df)
    >>> pd.Series([variance_inflation_factor(X.values, i) 
                   for i in range(X.shape[1])], 
                  index=X.columns)
    const    136.875
    a         22.950
    b          3.000
    c         12.950
    d          3.000
    dtype: float64
    

    I believe you could also add the constant to the right most column of the dataframe using assign:

    X = df.assign(const=1)
    >>> pd.Series([variance_inflation_factor(X.values, i) 
                   for i in range(X.shape[1])], 
                  index=X.columns)
    a         22.950
    b          3.000
    c         12.950
    d          3.000
    const    136.875
    dtype: float64
    

    The source code itself is rather concise:

    def variance_inflation_factor(exog, exog_idx):
        """
        exog : ndarray, (nobs, k_vars)
            design matrix with all explanatory variables, as for example used in
            regression
        exog_idx : int
            index of the exogenous variable in the columns of exog
        """
        k_vars = exog.shape[1]
        x_i = exog[:, exog_idx]
        mask = np.arange(k_vars) != exog_idx
        x_noti = exog[:, mask]
        r_squared_i = OLS(x_i, x_noti).fit().rsquared
        vif = 1. / (1. - r_squared_i)
        return vif
    

    It is also rather simple to modify the code to return all of the VIFs as a series:

    from statsmodels.regression.linear_model import OLS
    from statsmodels.tools.tools import add_constant
    
    def variance_inflation_factors(exog_df):
        '''
        Parameters
        ----------
        exog_df : dataframe, (nobs, k_vars)
            design matrix with all explanatory variables, as for example used in
            regression.
    
        Returns
        -------
        vif : Series
            variance inflation factors
        '''
        exog_df = add_constant(exog_df)
        vifs = pd.Series(
            [1 / (1. - OLS(exog_df[col].values, 
                           exog_df.loc[:, exog_df.columns != col].values).fit().rsquared) 
             for col in exog_df],
            index=exog_df.columns,
            name='VIF'
        )
        return vifs
    
    >>> variance_inflation_factors(df)
    const    136.875
    a         22.950
    b          3.000
    c         12.950
    Name: VIF, dtype: float64
    

    Per the solution of @T_T, one can also simply do the following:

    vifs = pd.Series(np.linalg.inv(df.corr().to_numpy()).diagonal(), 
                     index=df.columns, 
                     name='VIF')
    
    0 讨论(0)
  • 2020-12-22 23:28

    I wrote this function based on some other posts I saw on Stack and CrossValidated. It shows the features which are over the threshold and returns a new dataframe with the features removed.

    from statsmodels.stats.outliers_influence import variance_inflation_factor 
    from statsmodels.tools.tools import add_constant
    
    def calculate_vif_(df, thresh=5):
        '''
        Calculates VIF each feature in a pandas dataframe
        A constant must be added to variance_inflation_factor or the results will be incorrect
    
        :param df: the pandas dataframe containing only the predictor features, not the response variable
        :param thresh: the max VIF value before the feature is removed from the dataframe
        :return: dataframe with features removed
        '''
        const = add_constant(df)
        cols = const.columns
        variables = np.arange(const.shape[1])
        vif_df = pd.Series([variance_inflation_factor(const.values, i) 
                   for i in range(const.shape[1])], 
                  index=const.columns).to_frame()
    
        vif_df = vif_df.sort_values(by=0, ascending=False).rename(columns={0: 'VIF'})
        vif_df = vif_df.drop('const')
        vif_df = vif_df[vif_df['VIF'] > thresh]
    
        print 'Features above VIF threshold:\n'
        print vif_df[vif_df['VIF'] > thresh]
    
        col_to_drop = list(vif_df.index)
    
        for i in col_to_drop:
            print 'Dropping: {}'.format(i)
            df = df.drop(columns=i)
    
        return df
    
    0 讨论(0)
  • 2020-12-22 23:30

    Example for Boston Data:

    VIF is calculated by auxiliary regression, so not dependent on the actual fit.

    See below:

    from patsy import dmatrices
    from statsmodels.stats.outliers_influence import variance_inflation_factor
    import statsmodels.api as sm
    
    # Break into left and right hand side; y and X
    y, X = dmatrices(formula="medv ~ crim + zn + nox + ptratio + black + rm ", data=boston, return_type="dataframe")
    
    # For each Xi, calculate VIF
    vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    
    # Fit X to y
    result = sm.OLS(y, X).fit()
    
    0 讨论(0)
  • 2020-12-22 23:31

    Although it is already late, I am adding some modifications from the given answer. To get the best set after removing multicollinearity if we use @Chef1075 solution then we will lose the variables which are correlated. We have to remove only one of them. To do this I came with the following solution using @steve answer:

    import pandas as pd
    from sklearn.linear_model import LinearRegression
    
    def sklearn_vif(exogs, data):
        '''
        This function calculates variance inflation function in sklearn way. 
         It is a comparatively faster process.
    
        '''
        # initialize dictionaries
        vif_dict, tolerance_dict = {}, {}
    
        # form input data for each exogenous variable
        for exog in exogs:
            not_exog = [i for i in exogs if i != exog]
            X, y = data[not_exog], data[exog]
    
            # extract r-squared from the fit
            r_squared = LinearRegression().fit(X, y).score(X, y)
    
            # calculate VIF
            vif = 1/(1 - r_squared)
            vif_dict[exog] = vif
    
            # calculate tolerance
            tolerance = 1 - r_squared
            tolerance_dict[exog] = tolerance
    
        # return VIF DataFrame
        df_vif = pd.DataFrame({'VIF': vif_dict, 'Tolerance': tolerance_dict})
    
        return df_vif
    df = pd.DataFrame(
    {'a': [1, 1, 2, 3, 4,1],
     'b': [2, 2, 3, 2, 1,3],
     'c': [4, 6, 7, 8, 9,5],
     'd': [4, 3, 4, 5, 4,6],
     'e': [8,8,14,15,17,20]}
      )
    
    df_vif= sklearn_vif(exogs=df.columns, data=df).sort_values(by='VIF',ascending=False)
    while (df_vif.VIF>5).any() ==True:
        red_df_vif= df_vif.drop(df_vif.index[0])
        df= df[red_df_vif.index]
        df_vif=sklearn_vif(exogs=df.columns,data=df).sort_values(by='VIF',ascending=False)
    
    
    
    
    print(df)
    
       d  c  b
    0  4  4  2
    1  3  6  2
    2  4  7  3
    3  5  8  2
    4  4  9  1
    5  6  5  3
    
    0 讨论(0)
  • 2020-12-22 23:34

    here code using dataframe python:

    To create data

    import numpy as np
    import scipy as sp

    a = [1, 1, 2, 3, 4]
    b = [2, 2, 3, 2, 1]
    c = [4, 6, 7, 8, 9]
    d = [4, 3, 4, 5, 4]

    To create dataframe

    import pandas as pd
    data = pd.DataFrame()
    data["a"] = a
    data["b"] = b
    data["c"] = c
    data["d"] = d

    Calculate VIF

    cc = np.corrcoef(data, rowvar=False)
    VIF = np.linalg.inv(cc)
    VIF.diagonal()

    Result

    array([22.95, 3. , 12.95, 3. ])

    0 讨论(0)
  • 2020-12-22 23:34

    For future comers to this thread (like me):

    import numpy as np
    import scipy as sp
    
    a = [1, 1, 2, 3, 4]
    b = [2, 2, 3, 2, 1]
    c = [4, 6, 7, 8, 9]
    d = [4, 3, 4, 5, 4]
    
    ck = np.column_stack([a, b, c, d])
    cc = sp.corrcoef(ck, rowvar=False)
    VIF = np.linalg.inv(cc)
    VIF.diagonal()
    

    This code gives

    array([22.95,  3.  , 12.95,  3.  ])
    

    [EDIT]

    In response to a comment, I tried to use DataFrame as much as possible (numpy is required to invert a matrix).

    import pandas as pd
    import numpy as np
    
    a = [1, 1, 2, 3, 4]
    b = [2, 2, 3, 2, 1]
    c = [4, 6, 7, 8, 9]
    d = [4, 3, 4, 5, 4]
    
    df = pd.DataFrame({'a':a,'b':b,'c':c,'d':d})
    df_cor = df.corr()
    pd.DataFrame(np.linalg.inv(df.corr().values), index = df_cor.index, columns=df_cor.columns)
    

    The code gives

           a            b           c           d
    a   22.950000   6.453681    -16.301917  -6.453681
    b   6.453681    3.000000    -4.080441   -2.000000
    c   -16.301917  -4.080441   12.950000   4.080441
    d   -6.453681   -2.000000   4.080441    3.000000
    

    The diagonal elements give VIF.

    0 讨论(0)
提交回复
热议问题