Detecting mulicollinear , or columns that have linear combinations while modelling in Python : LinAlgError

问题

I am modelling data for a logit model with 34 dependent variables,and it keep throwing in the singular matrix error , as below -:

Traceback (most recent call last):
  File "<pyshell#1116>", line 1, in <module>
    test_scores  = smf.Logit(m['event'], train_cols,missing='drop').fit()
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 1186, in fit
    disp=disp, callback=callback, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 164, in fit
    disp=disp, callback=callback, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 357, in fit
    hess=hess)
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 405, in _fit_mle_newton
    newparams = oldparams - np.dot(np.linalg.inv(H),
  File "/usr/local/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 445, in inv
    return wrap(solve(a, identity(a.shape[0], dtype=a.dtype)))
  File "/usr/local/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 328, in solve
    raise LinAlgError, 'Singular matrix'
LinAlgError: Singular matrix

Which was when I stumpled on this method to reduce the matrix to its independent columns

def independent_columns(A, tol = 0):#1e-05):
    """
    Return an array composed of independent columns of A.

    Note the answer may not be unique; this function returns one of many
    possible answers.

    https://stackoverflow.com/q/13312498/190597 (user1812712)
    http://math.stackexchange.com/a/199132/1140 (Gerry Myerson)
    http://mail.scipy.org/pipermail/numpy-discussion/2008-November/038705.html
        (Anne Archibald)

    >>> A = np.array([(2,4,1,3),(-1,-2,1,0),(0,0,2,2),(3,6,2,5)])
    2 4 1 3
    -1 -2 1 0
    0 0 2 2
    3 6 2 5
    # try with checking the rank of matrixs 
    >>> independent_columns(A)
    np.array([[1, 4],
              [2, 5],
              [3, 6]])
    """
    Q, R = linalg.qr(A)
    independent = np.where(np.abs(R.diagonal()) > tol)[0]
    #print independent
    return A[:, independent], independent


A,independent_col_indexes=independent_columns(train_cols.as_matrix(columns=None)) 
#train_cols will not be converted back from a df to a  matrix object,so doing this explicitly
A2=pd.DataFrame(A, columns=train_cols.columns[independent_col_indexes])

test_scores = smf.Logit(m['event'],A2,missing='drop').fit()

I still get the LinAlgError , though I was hoping I will have the reduced matrix rank now.

Also, I see np.linalg.matrix_rank(train_cols) returns 33 (ie. before calling on the independent_columns function total "x" columns was 34(ie, len(train_cols.ix[0])=34 ), meaning I don't have a full rank matrix), while np.linalg.matrix_rank(A2) returns 33 (meaning I have dropped a columns, and yet I still see the LinAlgError , when I run test_scores = smf.Logit(m['event'],A2,missing='drop').fit() , what am I missing ?

reference to the code above - How to find degenerate rows/columns in a covariance matrix

I tried to start building the model forward,by introducing each variable at a time, which doesn't give me the singular matrix error, but I would rather have a method that is deterministic, and lets me know, what am I doing wrong & how to eliminate these columns.

Edit (updated post the suggestions by @ user333700 below)

1. You are right, "A2" doesn't have the reduced rank of 33 . ie. len(A2.ix[0]) =34 -> meaning the possibly collinear columns are not dropped - should I increase the "tol", tolerance to get rank of A2 (and the numbers of columns thereof) , as 33. If I change the tol to "1e-05" above, then I do get len(A2.ix[0]) =33, which suggests to me that tol >0 (strictly) is one indicator. After this I just did the same, test_scores = smf.Logit(m['event'],A2,missing='drop').fit(), without nm to get the convergence.

2. Errors post trying 'nm' method. Strange thing though is that if I take just 20,000 rows, I do get the results. Since it is not showing up Memory error, but "Inverting hessian failed, no bse or cov_params available" - I am assuming, there are multiple nearly-similar records - what would you say ?

m  = smf.Logit(data['event_custom'].ix[0:1000000] , train_cols.ix[0:1000000],missing='drop')
test_scores=m.fit(start_params=None,method='nm',maxiter=200,full_output=1)
Warning: Maximum number of iterations has been exceeded

Warning (from warnings module):
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 374
    warn(warndoc, Warning)
Warning: Inverting hessian failed, no bse or cov_params available


test_scores.summary()

Traceback (most recent call last):
  File "<pyshell#17>", line 1, in <module>
    test_scores.summary()
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 2396, in summary
    yname_list)
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 2253, in summary
    use_t=False)
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/iolib/summary.py", line 826, in add_table_params
    use_t=use_t)
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/iolib/summary.py", line 447, in summary_params
    std_err = results.bse
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/tools/decorators.py", line 95, in __get__
    _cachedval = self.fget(obj)
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 1037, in bse
    return np.sqrt(np.diag(self.cov_params()))
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 1102, in cov_params
    raise ValueError('need covariance of parameters for computing '
ValueError: need covariance of parameters for computing (unnormalized) covariances

Edit 2: (updated post the suggestions by @user333700 below)

Reiterating what I am trying to model - less than about 1% of total users "convert" (success outcomes) - so I took a balanced sample of 35(+ve) /65 (-ve)

I suspect the model is not robust, though it converges. So, will use "start_params" as the params from earlier iteration, from a different dataset. This edit is about confirming is the "start_params" can feed into the results as below -:

A,independent_col_indexes=independent_columns(train_cols.as_matrix(columns=None))
A2=pd.DataFrame(A, columns=train_cols.columns[independent_col_indexes])
m  = smf.Logit(data['event_custom'], A2,missing='drop')
#m  = smf.Logit(data['event_custom'], train_cols,missing='drop')#,method='nm').fit()#This doesnt work, so tried 'nm' which work, but used lasso, as nm did not converge.
test_scores=m.fit_regularized(start_params=None, method='l1', maxiter='defined_by_method', full_output=1, disp=1, callback=None, alpha=0, \
trim_mode='auto', auto_trim_tol=0.01, size_trim_tol=0.0001, qc_tol=0.03)

a_good_looking_previous_result.params=test_scores.params #storing the parameters of pass1 to feed into pass2

test_scores.params
bidfloor_Quartile_modified_binned_0               0.305765
connectiontype_binned_0                          -0.436798
day_custom_binned_Fri                            -0.040269
day_custom_binned_Mon                             0.138599
day_custom_binned_Sat                            -0.319997
day_custom_binned_Sun                            -0.236507
day_custom_binned_Thu                            -0.058922
user_agent_device_family_binned_iPad            -10.793270
user_agent_device_family_binned_iPhone           -8.483099
user_agent_masterclass_binned_apple               9.038889
user_agent_masterclass_binned_generic            -0.760297
user_agent_masterclass_binned_samsung            -0.063522
log_height_width                                  0.593199
log_height_width_ScreenResolution                -0.520836
productivity                                     -1.495373
games                                             0.706340
entertainment                                    -1.806886
IAB24                                             2.531467
IAB17                                             0.650327
IAB14                                             0.414031
utilities                                         9.968253
IAB1                                              1.850786
social_networking                                -2.814148
IAB3                                             -9.230780
music                                             0.019584
IAB9                                             -0.415559
C(time_day_modified)[(6, 12]]:C(country)[AUS]    -0.103003
C(time_day_modified)[(0, 6]]:C(country)[HKG]      0.769272
C(time_day_modified)[(6, 12]]:C(country)[HKG]     0.406882
C(time_day_modified)[(0, 6]]:C(country)[IDN]      0.073306
C(time_day_modified)[(6, 12]]:C(country)[IDN]    -0.207568
C(time_day_modified)[(0, 6]]:C(country)[IND]      0.033370
... more params here

Now on a different dataset(pass2, for indexing), I model the same as below -: ie. I read a new dataframe, do all the variable transformation and then model via Logit as earlier .

m_pass2  = smf.Logit(data['event_custom'], A2_pass2,missing='drop')
test_scores_pass2=m_pass2.fit_regularized(start_params=a_good_looking_previous_result.params, method='l1', maxiter='defined_by_method', full_output=1, disp=1, callback=None, alpha=0, \
trim_mode='auto', auto_trim_tol=0.01, size_trim_tol=0.0001, qc_tol=0.03)

and, possibly keep iterating by picking up "start_params" from earlier passes.

回答1:

Several points to this:

You need tol > 0 to detect near perfect collinearity, which might also cause numerical problems in later calculations. Check the number of columns of A2 to see whether a column has really be dropped.

Logit needs to do some non-linear calculations with the exog, so even if the design matrix is not very close to perfect collinearity, the transformed variables for the log-likelihood, derivative or Hessian calculations might still end up being with numerical problems, like singular Hessian.

(All these are floating point problems when we work near floating point precision, 1e-15, 1e-16. There are sometimes differences in the default thresholds for matrix_rank and similar linalg functions which can imply that in some edge cases one function identifies it as singular and another one doesn't.)

The default optimization method for the discrete models including Logit is a simple Newton method, which is fast in reasonably nice cases, but can fail in cases that are badly conditioned. You could try one of the other optimizers which will be one of those in scipy.optimize, method='nm' is usually very robust but slow, method='bfgs' works well in many cases but also can run into convergence problems.

Nevertheless, even when one of the other optimization methods succeeds, it is still necessary to inspect the results. More often than not, a failure with one method means that the model or estimation problem might not be well defined.

A good way to check whether it is just a problem with bad starting values or a specification problem is to run method='nm' first and then run one of the more accurate methods like newton or bfgs using the nm estimate as starting value, and see whether it succeeds from good starting values.

来源：https://stackoverflow.com/questions/23848003/detecting-mulicollinear-or-columns-that-have-linear-combinations-while-modelli

标签

python-2.7

numpy

statsmodels

logistic-regression

singular