Unexpected standard errors with weighted least squares in Python Pandas

淺唱寂寞╮ 提交于 2019-12-21 19:47:05

问题


In the code for the main OLS class in Python Pandas, I am looking for help to clarify what conventions are used for the standard error and t-stats reported when weighted OLS is performed.

Here's my example data set, with some imports to use Pandas and to use scikits.statsmodels WLS directly:

import pandas
import numpy as np
from statsmodels.regression.linear_model import WLS

# Make some random data.
np.random.seed(42)
df = pd.DataFrame(np.random.randn(10, 3), columns=['a', 'b', 'weights'])

# Add an intercept term for direct use in WLS
df['intercept'] = 1 

# Add a number (I picked 10) to stabilize the weight proportions a little.
df['weights'] = df.weights + 10

# Fit the regression models.
pd_wls = pandas.ols(y=df.a, x=df.b, weights=df.weights)
sm_wls = WLS(df.a, df[['intercept','b']], weights=df.weights).fit()

I use %cpaste to execute this in IPython and then print the summaries of both regressions:

In [226]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:import pandas
:import numpy as np
:from statsmodels.regression.linear_model import WLS
:
:# Make some random data.
np:np.random.seed(42)
:df = pd.DataFrame(np.random.randn(10, 3), columns=['a', 'b', 'weights'])
:
:# Add an intercept term for direct use in WLS
:df['intercept'] = 1
:
:# Add a number (I picked 10) to stabilize the weight proportions a little.
:df['weights'] = df.weights + 10
:
:# Fit the regression models.
:pd_wls = pandas.ols(y=df.a, x=df.b, weights=df.weights)
:sm_wls = WLS(df.a, df[['intercept','b']], weights=df.weights).fit()
:--

In [227]: pd_wls
Out[227]:

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         10
Number of Degrees of Freedom:   2

R-squared:         0.2685
Adj R-squared:     0.1770

Rmse:              2.4125

F-stat (1, 8):     2.9361, p-value:     0.1250

Degrees of Freedom: model 1, resid 8

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x     0.5768     1.0191       0.57     0.5869    -1.4206     2.5742
     intercept     0.5227     0.9079       0.58     0.5806    -1.2567     2.3021
---------------------------------End of Summary---------------------------------


In [228]: sm_wls.summ
sm_wls.summary      sm_wls.summary_old

In [228]: sm_wls.summary()
Out[228]:
<class 'statsmodels.iolib.summary.Summary'>
"""
                            WLS Regression Results
==============================================================================
Dep. Variable:                      a   R-squared:                       0.268
Model:                            WLS   Adj. R-squared:                  0.177
Method:                 Least Squares   F-statistic:                     2.936
Date:                Wed, 17 Jul 2013   Prob (F-statistic):              0.125
Time:                        15:14:04   Log-Likelihood:                -10.560
No. Observations:                  10   AIC:                             25.12
Df Residuals:                       8   BIC:                             25.72
Df Model:                           1
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
intercept      0.5227      0.295      1.770      0.115        -0.158     1.204
b              0.5768      0.333      1.730      0.122        -0.192     1.346
==============================================================================
Omnibus:                        0.967   Durbin-Watson:                   1.082
Prob(Omnibus):                  0.617   Jarque-Bera (JB):                0.622
Skew:                           0.003   Prob(JB):                        0.733
Kurtosis:                       1.778   Cond. No.                         1.90
==============================================================================
"""

Notice the mismatching standard errors: Pandas claims the standard errors are [0.9079, 1.0191] while statsmodels says [0.295, 0.333].

Back in the code I linked at the top of the post I tried to track where the mismatch comes from.

First, you can see that the standard errors are reports by the function:

def _std_err_raw(self):
    """Returns the raw standard err values."""
    return np.sqrt(np.diag(self._var_beta_raw))

So looking at self._var_beta_raw I find:

def _var_beta_raw(self):
    """
    Returns the raw covariance of beta.
    """
    x = self._x.values
    y = self._y.values

    xx = np.dot(x.T, x)

    if self._nw_lags is None:
        return math.inv(xx) * (self._rmse_raw ** 2)
    else:
        resid = y - np.dot(x, self._beta_raw)
        m = (x.T * resid).T

        xeps = math.newey_west(m, self._nw_lags, self._nobs, self._df_raw,
                               self._nw_overlap)

        xx_inv = math.inv(xx)
        return np.dot(xx_inv, np.dot(xeps, xx_inv))

In my use case, self._nw_lags will be None always, so it's the first part that's puzzling. Since xx is just the standard product of the regressor matrix: x.T.dot(x), I'm wondering how the weights affect this. The term self._rmse_raw comes directly from the statsmodels regression being fitted in the constructor of OLS, so that most definitely incorporates the weights.

This prompts these questions:

  1. Why is the standard error reported with weights being applied in the RMSE part, but not to the regressor variables.
  2. Is this standard practice if you want the "non-transformed" variables (wouldn't you then also want the non-transformed RMSE??) Is there a way to have Pandas give back the fully-weighted version of the standard error?
  3. Why all the misdirection? In the constructor, the full statsmodels fitted regression is computed. Why wouldn't absolutely every summary statistic come straight from there? Why is it mixed and matched so that some come from the statsmodels output and some come from Pandas home-cooked calculations?

It looks like I can reconcile the Pandas output by doing the following:

In [238]: xs = df[['intercept', 'b']]

In [239]: trans_xs = xs.values * np.sqrt(df.weights.values[:,None])

In [240]: trans_xs
Out[240]:
array([[ 3.26307961, -0.45116742],
       [ 3.12503809, -0.73173821],
       [ 3.08715494,  2.36918991],
       [ 3.08776136, -1.43092325],
       [ 2.87664425, -5.50382662],
       [ 3.21158019, -3.25278836],
       [ 3.38609639, -4.78219647],
       [ 2.92835309,  0.19774643],
       [ 2.97472796,  0.32996453],
       [ 3.1158155 , -1.87147934]])

In [241]: np.sqrt(np.diag(np.linalg.inv(trans_xs.T.dot(trans_xs)) * (pd_wls._rmse_raw ** 2)))
Out[241]: array([ 0.29525952,  0.33344823])

I'm just very confused by this relationship. Is this what is common among statisticians: involving the weights with the RMSE part, but then choosing whether or not to weight the variables when calculating standard error of the coefficient? If that's the case, why wouldn't the coefficients themselves also be different between Pandas and statsmodels, since those are similarly derived from variables first transformed by statsmodels?

For reference, here was the full data set used in my toy example (in case np.random.seed isn't sufficient to make it reproducible):

In [242]: df
Out[242]:
          a         b    weights  intercept
0  0.496714 -0.138264  10.647689          1
1  1.523030 -0.234153   9.765863          1
2  1.579213  0.767435   9.530526          1
3  0.542560 -0.463418   9.534270          1
4  0.241962 -1.913280   8.275082          1
5 -0.562288 -1.012831  10.314247          1
6 -0.908024 -1.412304  11.465649          1
7 -0.225776  0.067528   8.575252          1
8 -0.544383  0.110923   8.849006          1
9  0.375698 -0.600639   9.708306          1

回答1:


Not directly answering your question here, but, in general, you should prefer the statsmodels code to pandas for modeling. There were some recently discovered problems with WLS in statsmodels that are now fixed. AFAIK, they were also fixed in pandas, but for the most part the pandas modeling code is not maintained and the medium term goal is to make sure everything available in pandas is deprecated and has been moved to statsmodels (next release 0.6.0 for statsmodels should do it).

To be a little clearer, pandas is now a dependency of statsmodels. You can pass DataFrames to statsmodels or use formulas in statsmodels. This is the intended relationship going forward.



来源:https://stackoverflow.com/questions/17708643/unexpected-standard-errors-with-weighted-least-squares-in-python-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!