问题
I'm doing logistic regression in Python with this example from wikipedia. link to example
here's the code I have:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
Z = [[0.5], [0.75], [1.0], [1.25], [1.5], [1.75], [1.75], [2.0], [2.25], [2.5], [2.75], [3.0], [3.25], [3.5], [4.0], [4.25], [4.5], [4.75], [5.0], [5.5]] # number of hours spent studying
y = [0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1] # 0=failed, 1=pass
lr.fit(Z,y)
results for this are
lr.coef_
array([[ 0.61126347]])
lr.intercept_
array([-1.36550178])
while they get values 1.5046 for hour coefficient and -4.0777 intercept. why are the results so different? their prediction for 1 hour of study is probability 0.07 to pass, while i get 0.32 with this model, these are drastically different results.
回答1:
The "problem" is that LogisticRegression in scikit-learn uses L2-regularization (aka Tikhonov regularization, aka Ridge, aka normal prior). Please read sklearn user guide about logistic regression for implementational details.
In practice, it means that LogisticRegression
has a parameter C
, which by default equals 1
. The smaller C
, the more regularization there is - it means, coef_
grows smaller, and intercept_
larger, which increases numerical stability and reduces overfitting.
If you set C
very large, the effect of regularization will vanish. With
lr = LogisticRegression(C=100500000)
you get coef_ and intercept_ respectively
[[ 1.50464535]]
[-4.07771322]
just like in the Wikipedia article.
Some more theory. Overfitting is a problem where there are lots of features, but not too much examples. A simple rule of thumb: use small C, if n_obs/n_features is less that 10. In the wiki example, there is one feature and 20 observations, so simple logistic regression would not overfit even with large C.
Another use case for small C is convergence problems. They may happen if positive and negative examples can be perfectly separated or in case of multicollinearity (which again is more likely if n_obs/n_features is small), and lead to infinite growth of coefficient in the non-regularized case.
回答2:
I think the problem is arising from the fact that you have
Z = [[0.5], [0.75], [1.0], [1.25], [1.5], [1.75], [1.75], [2.0], [2.25], [2.5], [2.75], [3.0], [3.25], [3.5], [4.0], [4.25], [4.5], [4.75], [5.0], [5.5]]
but instead it should be
Z = [0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 1.75, 2.0, 2.25 ...]
Try this
来源:https://stackoverflow.com/questions/47248587/confusing-results-with-logistic-regression-in-python