How to drop insignificant categorical interaction terms Python StatsModel

别说谁变了你拦得住时间么 提交于 2020-01-04 12:56:32

问题


In stats model it's easy to add interaction term. However not all of the interactions are significant. My question is how to drop those that are insignificant? For example airport at Kootenay.

# -*- coding: utf-8 -*-
import pandas as pd
import statsmodels.formula.api as sm


if __name__ == "__main__":

    # Read data
    census_subdivision_without_lower_mainland_and_van_island = pd.read_csv('../data/augmented/census_subdivision_without_lower_mainland_and_van_island.csv')

    # Fit all data
    fit = sm.ols(formula="instagram_posts ~ airports * C(CNMCRGNNM) + ports_and_ferry_terminals + railway_stations + accommodations + visitor_centers + festivals + attractions + C(CNMCRGNNM) + C(CNSSSBDVS3)", data=census_subdivision_without_lower_mainland_and_van_island).fit()
    print(fit.summary())


回答1:


I tried to recreate some of the data, focusing on the variables in the interaction. I'm not sure if the objective is solely to get the values out, or if you need a specific format, but here is an example of how to solve the issue using pandas (since you're importing pandas in the original post):

import pandas as pd
import statsmodels.formula.api as sm
np.random.seed(2)

df = pd.DataFrame()
df['instagram_posts'] = np.random.rand(50)
df['airports'] = np.random.rand(50)
df['CNMCRGNNM'] = np.random.choice(['Kootenay','Nechako','North Coast','Northeast','Thompson-Okanagan'],50)

fit = sm.ols(formula="instagram_posts ~ airports * C(CNMCRGNNM)",data=df).fit()
print(fit.summary())

This is the output:

==============================================================================================================
                                                 coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------------
Intercept                                      0.4594      0.159      2.885      0.006       0.138       0.781
C(CNMCRGNNM)[T.Nechako]                       -0.2082      0.195     -1.067      0.292      -0.602       0.186
C(CNMCRGNNM)[T.North Coast]                   -0.1268      0.360     -0.352      0.726      -0.854       0.601
C(CNMCRGNNM)[T.Northeast]                      0.0930      0.199      0.468      0.642      -0.309       0.495
C(CNMCRGNNM)[T.Thompson-Okanagan]              0.1439      0.245      0.588      0.560      -0.351       0.638
airports                                      -0.1616      0.277     -0.583      0.563      -0.722       0.398
airports:C(CNMCRGNNM)[T.Nechako]               0.7870      0.343      2.297      0.027       0.094       1.480
airports:C(CNMCRGNNM)[T.North Coast]           0.3008      0.788      0.382      0.705      -1.291       1.893
airports:C(CNMCRGNNM)[T.Northeast]            -0.0104      0.348     -0.030      0.976      -0.713       0.693
airports:C(CNMCRGNNM)[T.Thompson-Okanagan]    -0.0311      0.432     -0.072      0.943      -0.904       0.842

Change alpha to your preferred level of significance:

alpha = 0.05
df = pd.DataFrame(data = [x for x in fit.summary().tables[1].data[1:] if float(x[4]) < alpha], columns = fit.summary().tables[1].data[0])

Data frame df holds those records in the original table that are significant for alpha. In this case, it's the Intercept and airports:C(CNMCRGNNM)[T.Nechako].




回答2:


You also might want to consider dropping the features one by one (starting with the most insignificant one). This is because one feature can become significant depending on the absence or presence of another. The code below will do this for you (I'm assuming you've already defined your X and your y ):

import operator
import statsmodels.api as sm
import pandas as pd

def remove_most_insignificant(df, results):
    # use operator to find the key which belongs to the maximum value in the dictionary:
    max_p_value = max(results.pvalues.iteritems(), key=operator.itemgetter(1))[0]
    # this is the feature you want to drop:
    df.drop(columns = max_p_value, inplace = True)
    return df

insignificant_feature = True
while insignificant_feature:
        model = sm.OLS(y, X)
        results = model.fit()
        significant = [p_value < 0.05 for p_value in results.pvalues]
        if all(significant):
            insignificant_feature = False
        else:
            if X.shape[1] == 1:  # if there's only one insignificant variable left
                print('No significant features found')
                results = None
                insignificant_feature = False
            else:            
                X = remove_most_insignificant(X, results)
print(results.summary())


来源:https://stackoverflow.com/questions/44962286/how-to-drop-insignificant-categorical-interaction-terms-python-statsmodel

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!