How do I get the columns that a statsmodels / patsy formula depends on?

自古美人都是妖i 提交于 2019-12-22 10:55:57

问题


Suppose I have a pandas dataframe:

df = pd.DataFrame({'x1': [0, 1, 2, 3, 4], 
                   'x2': [10, 9, 8, 7, 6], 
                   'x3': [.1, .1, .2, 4, 8], 
                   'y': [17, 18, 19, 20, 21]})

Now I fit a statsmodels model using a formula (which uses patsy under the hood):

import statsmodels.formula.api as smf
fit = smf.ols(formula='y ~ x1:x2', data=df).fit()

What I want is a list of the columns of df that fit depends on, so that I can use fit.predict() on another dataset. If I try list(fit.params.index), for example, I get:

['Intercept', 'x1:x2']

I've tried recreating the patsy design matrix, and using design_info, but I still only ever get x1:x2. What I want is:

['x1', 'x2']

Or even:

['Intercept', 'x1', 'x2']

How can I get this from just the fit object?


回答1:


Simply test if the column names appear in the string representation of the formula:

ols = smf.ols(formula='y ~ x1:x2', data=df)
fit = ols.fit()

print([c for c in df.columns if c in ols.formula])
['x1', 'x2', 'y']

There is another approach by reconstructing the patsy model (more verbose, but also more reliable) and it does not depend on the original data frame:

md = patsy.ModelDesc.from_formula(ols.formula)
termlist = md.rhs_termlist + md.lhs_termlist

factors = []
for term in termlist:
    for factor in term.factors:
        factors.append(factor.name())

print(factors)
['x1', 'x2', 'y']



回答2:


predict takes the same structure of data frame or a dictionary, and a call patsy converts it in a compatible way. To replicate this you can also check the code in statsmodels.base.model.Results.predict the core of which is

exog = dmatrix(self.model.data.design_info.builder,
                           exog, return_type="dataframe")

The formula information itself is stored in the description of the terms in design_info. The variable names itself are used in summary() and as index in the returned pandas Series for example in results.params.



来源:https://stackoverflow.com/questions/43378033/how-do-i-get-the-columns-that-a-statsmodels-patsy-formula-depends-on

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!