问题
I have a dataframe df
with the following fields: weight
, length
, and animal
. The first 2 are continuous variables, while animal
is a categorical variable with the values cat
, dog
, and snake
.
I'd like to estimate a relationship between weight and length, but this needs to be conditioned on the type of animal, so I interact the length variable with the animal
categorical variable.
model = ols(formula='weight ~ length * animal', data=df)
results = model.fit()
How can I programmatically extract the slope of the relationship between weight and length for e.g. snakes? I understand how to do this manually: add the coefficient for length
to the coefficient for animal[T.snake]:length
. But this is somewhat cumbersome and manual, and requires me to handle the base case specially, so I'd like to extract this information automatically.
Furthermore, I'd like to estimate the error on this slope. I believe I understand how to calculate this by combining the standard errors and covariances (more precisely, performing the calculation here). But this is even more cumbersome than the above, and I'm similarly wondering if there's a shortcut to extract this information.
My manual method to calculate these follows.
EDIT (06/22/2015): there seems to be an error in my original code below for calculating errors. The standard errors as calculated in user333700's answer are different from the ones I calculate, but I haven't invested any time in figuring out why.
def get_contained_animal(animals, p):
# This relies on parameters of the form animal[T.snake]:length.
for a in animals:
if a in p:
return a
return None
animals = ['cat', 'dog', 'snake']
slopes = {}
errors = {}
for animal in animals:
slope = 0.
params = []
# If this param is related to the length variable and
# the animal in question, add it to the slope.
for param, val in results.params.iteritems():
ac = get_contained_animal(animals, param)
if (param == 'length' or
('length' in param and
ac is None or ac == animal)):
params.append(param)
slope += val
# Calculate the overall error by adding standard errors and
# covariances.
tot_err = 0.
for i, p1 in enumerate(params):
tot_err += results.bse[p1]*results.bse[p1]
for j, p2 in enumerate(params[i:]):
# add covariance of these parameters
tot_err += 2*results.cov_params()[p1][p2]
slopes[animal] = slope
errors[animal] = tot_err**0.5
This code might seem like overkill, but in my real-world use case I have a continuous variable interacting with two separate categorical variables, each with a large number of categories (along with other terms in the model that I need to ignore for these purposes).
回答1:
Very brief background:
The general question for this is how does the prediction change if we change on of the explanatory variables, holding other explanatory variables fixed or averaging over those.
In the nonlinear discrete models, there is a special Margins method that calculates this, although it is not implemented for changes in categorical variables.
In the linear model, the prediction and change in prediction is just a linear function of the estimated parameters, and we can (mis)use t_test
to calculate the effect, its standard error and confidence interval for us.
(Aside: There are more helper methods in the works for statsmodels to make prediction and margin calculations like this easier and will be available most likely later in the year.)
As brief explanation of the following code:
- I make up a similar example.
- I define the explanatory variables for length = 1 or 2, for each animal type
- Then, I calculate the difference in these explanatory variables
- This defines linear combinations or contrast of parameters, that can be used in t_test.
Finally, I compare with the result from predict to check that I didn't make any obvious mistakes. (I assume this is correct but I had written it pretty fast.)
import numpy as np
import pandas as pd
from statsmodels.regression.linear_model import OLS
np.random.seed(2)
nobs = 20
animal_names = np.array(['cat', 'dog', 'snake'])
animal_idx = np.random.random_integers(0, 2, size=nobs)
animal = animal_names[animal_idx]
length = np.random.randn(nobs) + animal_idx
weight = np.random.randn(nobs) + animal_idx + length
data = pd.DataFrame(dict(length=length, weight=weight, animal=animal))
res = OLS.from_formula('weight ~ length * animal', data=data).fit()
print(res.summary())
data_predict1 = data = pd.DataFrame(dict(length=np.ones(3), weight=np.ones(3),
animal=animal_names))
data_predict2 = data = pd.DataFrame(dict(length=2*np.ones(3), weight=np.ones(3),
animal=animal_names))
import patsy
x1 = patsy.dmatrix('length * animal', data_predict1)
x2 = patsy.dmatrix('length * animal', data_predict2)
tt = res.t_test(x2 - x1)
print(tt.summary(xname=animal_names.tolist()))
The result of the last print is
Test for Constraints
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
cat 1.0980 0.280 3.926 0.002 0.498 1.698
dog 0.9664 0.860 1.124 0.280 -0.878 2.811
snake 1.5930 0.428 3.720 0.002 0.675 2.511
We can verify the results by using predict and compare the difference in predicted weight if the length for a given animal type increases from 1 to 2:
>>> [res.predict({'length': 2, 'animal':[an]}) - res.predict({'length': 1, 'animal':[an]}) for an in animal_names]
[array([ 1.09801656]), array([ 0.96641455]), array([ 1.59301594])]
>>> tt.effect
array([ 1.09801656, 0.96641455, 1.59301594])
Note: I forgot to add a seed for the random numbers and the numbers cannot be replicated.
来源:https://stackoverflow.com/questions/30948997/evaluate-slope-and-error-for-specific-category-for-statsmodels-ols-fit