问题
In the code for the main OLS class in Python Pandas, I am looking for help to clarify what conventions are used for the standard error and t-stats reported when weighted OLS is performed.
Here's my example data set, with some imports to use Pandas and to use scikits.statsmodels WLS directly:
import pandas
import numpy as np
from statsmodels.regression.linear_model import WLS
# Make some random data.
np.random.seed(42)
df = pd.DataFrame(np.random.randn(10, 3), columns=['a', 'b', 'weights'])
# Add an intercept term for direct use in WLS
df['intercept'] = 1
# Add a number (I picked 10) to stabilize the weight proportions a little.
df['weights'] = df.weights + 10
# Fit the regression models.
pd_wls = pandas.ols(y=df.a, x=df.b, weights=df.weights)
sm_wls = WLS(df.a, df[['intercept','b']], weights=df.weights).fit()
I use %cpaste
to execute this in IPython and then print the summaries of both regressions:
In [226]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:import pandas
:import numpy as np
:from statsmodels.regression.linear_model import WLS
:
:# Make some random data.
np:np.random.seed(42)
:df = pd.DataFrame(np.random.randn(10, 3), columns=['a', 'b', 'weights'])
:
:# Add an intercept term for direct use in WLS
:df['intercept'] = 1
:
:# Add a number (I picked 10) to stabilize the weight proportions a little.
:df['weights'] = df.weights + 10
:
:# Fit the regression models.
:pd_wls = pandas.ols(y=df.a, x=df.b, weights=df.weights)
:sm_wls = WLS(df.a, df[['intercept','b']], weights=df.weights).fit()
:--
In [227]: pd_wls
Out[227]:
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <x> + <intercept>
Number of Observations: 10
Number of Degrees of Freedom: 2
R-squared: 0.2685
Adj R-squared: 0.1770
Rmse: 2.4125
F-stat (1, 8): 2.9361, p-value: 0.1250
Degrees of Freedom: model 1, resid 8
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
x 0.5768 1.0191 0.57 0.5869 -1.4206 2.5742
intercept 0.5227 0.9079 0.58 0.5806 -1.2567 2.3021
---------------------------------End of Summary---------------------------------
In [228]: sm_wls.summ
sm_wls.summary sm_wls.summary_old
In [228]: sm_wls.summary()
Out[228]:
<class 'statsmodels.iolib.summary.Summary'>
"""
WLS Regression Results
==============================================================================
Dep. Variable: a R-squared: 0.268
Model: WLS Adj. R-squared: 0.177
Method: Least Squares F-statistic: 2.936
Date: Wed, 17 Jul 2013 Prob (F-statistic): 0.125
Time: 15:14:04 Log-Likelihood: -10.560
No. Observations: 10 AIC: 25.12
Df Residuals: 8 BIC: 25.72
Df Model: 1
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
intercept 0.5227 0.295 1.770 0.115 -0.158 1.204
b 0.5768 0.333 1.730 0.122 -0.192 1.346
==============================================================================
Omnibus: 0.967 Durbin-Watson: 1.082
Prob(Omnibus): 0.617 Jarque-Bera (JB): 0.622
Skew: 0.003 Prob(JB): 0.733
Kurtosis: 1.778 Cond. No. 1.90
==============================================================================
"""
Notice the mismatching standard errors: Pandas claims the standard errors are [0.9079, 1.0191]
while statsmodels says [0.295, 0.333].
Back in the code I linked at the top of the post I tried to track where the mismatch comes from.
First, you can see that the standard errors are reports by the function:
def _std_err_raw(self):
"""Returns the raw standard err values."""
return np.sqrt(np.diag(self._var_beta_raw))
So looking at self._var_beta_raw
I find:
def _var_beta_raw(self):
"""
Returns the raw covariance of beta.
"""
x = self._x.values
y = self._y.values
xx = np.dot(x.T, x)
if self._nw_lags is None:
return math.inv(xx) * (self._rmse_raw ** 2)
else:
resid = y - np.dot(x, self._beta_raw)
m = (x.T * resid).T
xeps = math.newey_west(m, self._nw_lags, self._nobs, self._df_raw,
self._nw_overlap)
xx_inv = math.inv(xx)
return np.dot(xx_inv, np.dot(xeps, xx_inv))
In my use case, self._nw_lags
will be None
always, so it's the first part that's puzzling. Since xx
is just the standard product of the regressor matrix: x.T.dot(x)
, I'm wondering how the weights affect this. The term self._rmse_raw
comes directly from the statsmodels regression being fitted in the constructor of OLS
, so that most definitely incorporates the weights.
This prompts these questions:
- Why is the standard error reported with weights being applied in the RMSE part, but not to the regressor variables.
- Is this standard practice if you want the "non-transformed" variables (wouldn't you then also want the non-transformed RMSE??) Is there a way to have Pandas give back the fully-weighted version of the standard error?
- Why all the misdirection? In the constructor, the full statsmodels fitted regression is computed. Why wouldn't absolutely every summary statistic come straight from there? Why is it mixed and matched so that some come from the statsmodels output and some come from Pandas home-cooked calculations?
It looks like I can reconcile the Pandas output by doing the following:
In [238]: xs = df[['intercept', 'b']]
In [239]: trans_xs = xs.values * np.sqrt(df.weights.values[:,None])
In [240]: trans_xs
Out[240]:
array([[ 3.26307961, -0.45116742],
[ 3.12503809, -0.73173821],
[ 3.08715494, 2.36918991],
[ 3.08776136, -1.43092325],
[ 2.87664425, -5.50382662],
[ 3.21158019, -3.25278836],
[ 3.38609639, -4.78219647],
[ 2.92835309, 0.19774643],
[ 2.97472796, 0.32996453],
[ 3.1158155 , -1.87147934]])
In [241]: np.sqrt(np.diag(np.linalg.inv(trans_xs.T.dot(trans_xs)) * (pd_wls._rmse_raw ** 2)))
Out[241]: array([ 0.29525952, 0.33344823])
I'm just very confused by this relationship. Is this what is common among statisticians: involving the weights with the RMSE part, but then choosing whether or not to weight the variables when calculating standard error of the coefficient? If that's the case, why wouldn't the coefficients themselves also be different between Pandas and statsmodels, since those are similarly derived from variables first transformed by statsmodels?
For reference, here was the full data set used in my toy example (in case np.random.seed
isn't sufficient to make it reproducible):
In [242]: df
Out[242]:
a b weights intercept
0 0.496714 -0.138264 10.647689 1
1 1.523030 -0.234153 9.765863 1
2 1.579213 0.767435 9.530526 1
3 0.542560 -0.463418 9.534270 1
4 0.241962 -1.913280 8.275082 1
5 -0.562288 -1.012831 10.314247 1
6 -0.908024 -1.412304 11.465649 1
7 -0.225776 0.067528 8.575252 1
8 -0.544383 0.110923 8.849006 1
9 0.375698 -0.600639 9.708306 1
回答1:
Not directly answering your question here, but, in general, you should prefer the statsmodels code to pandas for modeling. There were some recently discovered problems with WLS in statsmodels that are now fixed. AFAIK, they were also fixed in pandas, but for the most part the pandas modeling code is not maintained and the medium term goal is to make sure everything available in pandas is deprecated and has been moved to statsmodels (next release 0.6.0 for statsmodels should do it).
To be a little clearer, pandas is now a dependency of statsmodels. You can pass DataFrames to statsmodels or use formulas in statsmodels. This is the intended relationship going forward.
来源:https://stackoverflow.com/questions/17708643/unexpected-standard-errors-with-weighted-least-squares-in-python-pandas