问题
I'm able to use the statsmodel's WLS (weighted least squares regression) fine when I have lots of datapoints. However, I seem to be having a problem with the numpy arrays when I try to use WLS for a single sample from the dataset.
What I mean is, if I have a dataset X which is a 2D array, with lots of rows, WLS works fine. But not if I try to work it on a single row. You'll get what I mean in the code below:
import sys
from sklearn.externals.six.moves import xrange
from sklearn.metrics import accuracy_score
import pylab as pl
from sklearn.externals.six.moves import zip
import numpy as np
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std
# this is my dataset X, with 10 rows
X = np.array([[1,2,3],[1,2,3],[4,5,6],[1,2,3],[4,5,6],[1,2,3],[1,2,3],[4,5,6],[4,5,6],[1,2,3]])
# this is my response vector, y, also with 10 rows
y = np.array([1, 1, 0, 1, 0, 1, 1, 0, 0, 1])
# weights, 10 rows
weights = np.array([ 0.1 , 0.1, 0.1 , 0.1, 0.1 , 0.1, 0.1 , 0.1, 0.1 , 0.1 ])
# the line below, using all 10 rows of X, gives no errors but is commented out
# mod_wls = sm.WLS(y, X, weights)
# and this is the line I need, which is giving errors:
mod_wls = sm.WLS(np.array(y[0]), np.array([X[0]]),np.array([weights[0]]))
The last line above was initially just mod_wls = sm.WLS(y[0], X[0], weights[0])
But that gave me errors like object of type 'numpy.float64' has no len()
, hence I turned them into arrays.
But now I keep getting this error:
Traceback (most recent call last):
File "C:\Users\app\Documents\Python Scripts\test.py", line 53, in <module>
mod_wls = sm.WLS(np.array(y[0]), np.array([X[0]]),np.array([weights[0]]))
File "C:\Users\app\Anaconda\lib\site-packages\statsmodels\regression\linear_model.py", line 383, in __init__
weights=weights, hasconst=hasconst)
File "C:\Users\app\Anaconda\lib\site-packages\statsmodels\regression\linear_model.py", line 79, in __init__
super(RegressionModel, self).__init__(endog, exog, **kwargs)
File "C:\Users\app\Anaconda\lib\site-packages\statsmodels\base\model.py", line 136, in __init__
super(LikelihoodModel, self).__init__(endog, exog, **kwargs)
File "C:\Users\app\Anaconda\lib\site-packages\statsmodels\base\model.py", line 52, in __init__
self.data = handle_data(endog, exog, missing, hasconst, **kwargs)
File "C:\Users\app\Anaconda\lib\site-packages\statsmodels\base\data.py", line 401, in handle_data
return klass(endog, exog=exog, missing=missing, hasconst=hasconst, **kwargs)
File "C:\Users\app\Anaconda\lib\site-packages\statsmodels\base\data.py", line 78, in __init__
self._check_integrity()
File "C:\Users\app\Anaconda\lib\site-packages\statsmodels\base\data.py", line 249, in _check_integrity
print len(self.endog)
TypeError: len() of unsized object
So in order to see what was wrong with the lengths, I did this:
print "y size: "
print len(np.array([y[0]]))
print "X size"
print len (np.array([X[0]]))
print "weights size"
print len(np.array([weights[0]]))
And got this output:
y size:
1
X size
1
weights size
1
I then tried this:
print "x shape"
print X[0].shape
print "y shape"
print y[0].shape
And the output was:
x shape
(3L,)
y shape
()
Line 249 in data.py, which the error referred to, has this function, where I added a bunch of "print sizes" in order to see what was happening:
def _check_integrity(self):
if self.exog is not None:
print "exog size: "
print len(self.exog)
print "endog size"
print len(self.endog) # <-- this, and the line below are causing the error
if len(self.exog) != len(self.endog):
raise ValueError("endog and exog matrices are different sizes")
It appears there's something wrong with len(self.endog)
. Although when I tried printing out len(np.array([y[0]]))
, it simply gave the output 1
. But somehow when y
goes into the check_integrity function and becomes endog
, it doesn't behave the same.... or is something else going on?
What should I do? I'm using an algorithm where I really do need to run WLS for each row of X
separately.
回答1:
There's no such thing as WLS for one observation. The single weight would simply become 1 when they're normalized to sum to 1. If you want to do this, though I supsect you don't, just use OLS. The solution will be a consequence of the SVD not any actual relationship in the data though.
OLS solution using pinv/svd
np.dot(np.linalg.pinv(X[[0]]), y[0])
Though you could just make up any answer that works and get the same result. I'm not sure offhand what exactly the properties of the SVD solution are vs. the other non-unique solutions.
[~/]
[26]: beta = [-.5, .25, 1/3.]
[~/]
[27]: np.dot(beta, X[0])
[27]: 1.0
来源:https://stackoverflow.com/questions/23345715/python-error-len-of-unsized-object-while-using-statsmodels-with-one-row-of-da