问题
I am modelling data for a logit model with 34 dependent variables,and it keep throwing in the singular matrix error , as below -:
Traceback (most recent call last):
File "<pyshell#1116>", line 1, in <module>
test_scores = smf.Logit(m['event'], train_cols,missing='drop').fit()
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 1186, in fit
disp=disp, callback=callback, **kwargs)
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 164, in fit
disp=disp, callback=callback, **kwargs)
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 357, in fit
hess=hess)
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 405, in _fit_mle_newton
newparams = oldparams - np.dot(np.linalg.inv(H),
File "/usr/local/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 445, in inv
return wrap(solve(a, identity(a.shape[0], dtype=a.dtype)))
File "/usr/local/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 328, in solve
raise LinAlgError, 'Singular matrix'
LinAlgError: Singular matrix
Which was when I stumpled on this method to reduce the matrix to its independent columns
def independent_columns(A, tol = 0):#1e-05):
"""
Return an array composed of independent columns of A.
Note the answer may not be unique; this function returns one of many
possible answers.
https://stackoverflow.com/q/13312498/190597 (user1812712)
http://math.stackexchange.com/a/199132/1140 (Gerry Myerson)
http://mail.scipy.org/pipermail/numpy-discussion/2008-November/038705.html
(Anne Archibald)
>>> A = np.array([(2,4,1,3),(-1,-2,1,0),(0,0,2,2),(3,6,2,5)])
2 4 1 3
-1 -2 1 0
0 0 2 2
3 6 2 5
# try with checking the rank of matrixs
>>> independent_columns(A)
np.array([[1, 4],
[2, 5],
[3, 6]])
"""
Q, R = linalg.qr(A)
independent = np.where(np.abs(R.diagonal()) > tol)[0]
#print independent
return A[:, independent], independent
A,independent_col_indexes=independent_columns(train_cols.as_matrix(columns=None))
#train_cols will not be converted back from a df to a matrix object,so doing this explicitly
A2=pd.DataFrame(A, columns=train_cols.columns[independent_col_indexes])
test_scores = smf.Logit(m['event'],A2,missing='drop').fit()
I still get the LinAlgError , though I was hoping I will have the reduced matrix rank now.
Also, I see np.linalg.matrix_rank(train_cols)
returns 33 (ie. before calling on the independent_columns function total "x" columns was 34(ie, len(train_cols.ix[0])=34
), meaning I don't have a full rank matrix), while np.linalg.matrix_rank(A2)
returns 33 (meaning I have dropped a columns, and yet I still see the LinAlgError , when I run test_scores = smf.Logit(m['event'],A2,missing='drop').fit()
, what am I missing ?
reference to the code above - How to find degenerate rows/columns in a covariance matrix
I tried to start building the model forward,by introducing each variable at a time, which doesn't give me the singular matrix error, but I would rather have a method that is deterministic, and lets me know, what am I doing wrong & how to eliminate these columns.
Edit (updated post the suggestions by @ user333700 below)
1. You are right, "A2" doesn't have the reduced rank of 33 . ie. len(A2.ix[0]) =34
-> meaning the possibly collinear columns are not dropped - should I increase the "tol", tolerance to get rank of A2 (and the numbers of columns thereof) , as 33. If I change the tol to "1e-05" above, then I do get len(A2.ix[0]) =33
, which suggests to me that tol >0 (strictly) is one indicator.
After this I just did the same, test_scores = smf.Logit(m['event'],A2,missing='drop').fit()
, without nm to get the convergence.
2. Errors post trying 'nm' method. Strange thing though is that if I take just 20,000 rows, I do get the results. Since it is not showing up Memory error, but "Inverting hessian failed, no bse or cov_params available
" - I am assuming, there are multiple nearly-similar records - what would you say ?
m = smf.Logit(data['event_custom'].ix[0:1000000] , train_cols.ix[0:1000000],missing='drop')
test_scores=m.fit(start_params=None,method='nm',maxiter=200,full_output=1)
Warning: Maximum number of iterations has been exceeded
Warning (from warnings module):
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 374
warn(warndoc, Warning)
Warning: Inverting hessian failed, no bse or cov_params available
test_scores.summary()
Traceback (most recent call last):
File "<pyshell#17>", line 1, in <module>
test_scores.summary()
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 2396, in summary
yname_list)
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 2253, in summary
use_t=False)
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/iolib/summary.py", line 826, in add_table_params
use_t=use_t)
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/iolib/summary.py", line 447, in summary_params
std_err = results.bse
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/tools/decorators.py", line 95, in __get__
_cachedval = self.fget(obj)
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 1037, in bse
return np.sqrt(np.diag(self.cov_params()))
File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 1102, in cov_params
raise ValueError('need covariance of parameters for computing '
ValueError: need covariance of parameters for computing (unnormalized) covariances
Edit 2: (updated post the suggestions by @user333700 below)
Reiterating what I am trying to model - less than about 1% of total users "convert" (success outcomes) - so I took a balanced sample of 35(+ve) /65 (-ve)
I suspect the model is not robust, though it converges. So, will use "start_params" as the params from earlier iteration, from a different dataset. This edit is about confirming is the "start_params" can feed into the results as below -:
A,independent_col_indexes=independent_columns(train_cols.as_matrix(columns=None))
A2=pd.DataFrame(A, columns=train_cols.columns[independent_col_indexes])
m = smf.Logit(data['event_custom'], A2,missing='drop')
#m = smf.Logit(data['event_custom'], train_cols,missing='drop')#,method='nm').fit()#This doesnt work, so tried 'nm' which work, but used lasso, as nm did not converge.
test_scores=m.fit_regularized(start_params=None, method='l1', maxiter='defined_by_method', full_output=1, disp=1, callback=None, alpha=0, \
trim_mode='auto', auto_trim_tol=0.01, size_trim_tol=0.0001, qc_tol=0.03)
a_good_looking_previous_result.params=test_scores.params #storing the parameters of pass1 to feed into pass2
test_scores.params
bidfloor_Quartile_modified_binned_0 0.305765
connectiontype_binned_0 -0.436798
day_custom_binned_Fri -0.040269
day_custom_binned_Mon 0.138599
day_custom_binned_Sat -0.319997
day_custom_binned_Sun -0.236507
day_custom_binned_Thu -0.058922
user_agent_device_family_binned_iPad -10.793270
user_agent_device_family_binned_iPhone -8.483099
user_agent_masterclass_binned_apple 9.038889
user_agent_masterclass_binned_generic -0.760297
user_agent_masterclass_binned_samsung -0.063522
log_height_width 0.593199
log_height_width_ScreenResolution -0.520836
productivity -1.495373
games 0.706340
entertainment -1.806886
IAB24 2.531467
IAB17 0.650327
IAB14 0.414031
utilities 9.968253
IAB1 1.850786
social_networking -2.814148
IAB3 -9.230780
music 0.019584
IAB9 -0.415559
C(time_day_modified)[(6, 12]]:C(country)[AUS] -0.103003
C(time_day_modified)[(0, 6]]:C(country)[HKG] 0.769272
C(time_day_modified)[(6, 12]]:C(country)[HKG] 0.406882
C(time_day_modified)[(0, 6]]:C(country)[IDN] 0.073306
C(time_day_modified)[(6, 12]]:C(country)[IDN] -0.207568
C(time_day_modified)[(0, 6]]:C(country)[IND] 0.033370
... more params here
Now on a different dataset(pass2, for indexing), I model the same as below -: ie. I read a new dataframe, do all the variable transformation and then model via Logit as earlier .
m_pass2 = smf.Logit(data['event_custom'], A2_pass2,missing='drop')
test_scores_pass2=m_pass2.fit_regularized(start_params=a_good_looking_previous_result.params, method='l1', maxiter='defined_by_method', full_output=1, disp=1, callback=None, alpha=0, \
trim_mode='auto', auto_trim_tol=0.01, size_trim_tol=0.0001, qc_tol=0.03)
and, possibly keep iterating by picking up "start_params" from earlier passes.
回答1:
Several points to this:
You need tol > 0 to detect near perfect collinearity, which might also cause numerical problems in later calculations.
Check the number of columns of A2
to see whether a column has really be dropped.
Logit needs to do some non-linear calculations with the exog, so even if the design matrix is not very close to perfect collinearity, the transformed variables for the log-likelihood, derivative or Hessian calculations might still end up being with numerical problems, like singular Hessian.
(All these are floating point problems when we work near floating point precision, 1e-15, 1e-16. There are sometimes differences in the default thresholds for matrix_rank and similar linalg functions which can imply that in some edge cases one function identifies it as singular and another one doesn't.)
The default optimization method for the discrete models including Logit is a simple Newton method, which is fast in reasonably nice cases, but can fail in cases that are badly conditioned. You could try one of the other optimizers which will be one of those in scipy.optimize, method='nm'
is usually very robust but slow, method='bfgs'
works well in many cases but also can run into convergence problems.
Nevertheless, even when one of the other optimization methods succeeds, it is still necessary to inspect the results. More often than not, a failure with one method means that the model or estimation problem might not be well defined.
A good way to check whether it is just a problem with bad starting values or a specification problem is to run method='nm'
first and then run one of the more accurate methods like newton
or bfgs
using the nm
estimate as starting value, and see whether it succeeds from good starting values.
来源:https://stackoverflow.com/questions/23848003/detecting-mulicollinear-or-columns-that-have-linear-combinations-while-modelli