LogisticRegression逻辑斯特回归性能分析_学习曲线

岁酱吖の 提交于 2020-01-14 02:32:42

LogisticRegression逻辑斯特回归性能分析_学习曲线

L2正则化

# 我们在乳腺癌数据集上详细分析 LogisticRegression
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, stratify=cancer.target, random_state=42)
logreg = LogisticRegression().fit(X_train, y_train)
print("Training set score: {:.3f}".format(logreg.score(X_train, y_train)))
print("Test set score: {:.3f}".format(logreg.score(X_test, y_test)))# C=1 的默认值给出了相当好的性能,在训练集和测试集上都达到 95% 的精度。但由于训练
# 集和测试集的性能非常接近,所以模型很可能是欠拟合的。我们尝试增大 C 来拟合一个更
# 灵活的模型:
Training set score: 0.946
Test set score: 0.958
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:939: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
logreg100 = LogisticRegression(C=100).fit(X_train, y_train)
print("Training set score: {:.3f}".format(logreg100.score(X_train, y_train)))
print("Test set score: {:.3f}".format(logreg100.score(X_test, y_test)))# 使用 C=100 可以得到更高的训练集精度,也得到了稍高的测试集精度,这也证实了我们的
# 直觉,即更复杂的模型应该性能更好。
Training set score: 0.946
Test set score: 0.965
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:939: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
# 我们还可以研究使用正则化更强的模型时会发生什么。设置 C=0.01:
​
logreg001 = LogisticRegression(C=0.01).fit(X_train, y_train)
print("Training set score: {:.3f}".format(logreg001.score(X_train, y_train)))
print("Test set score: {:.3f}".format(logreg001.score(X_test, y_test)))
Training set score: 0.934
Test set score: 0.930
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:939: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
# 看一下正则化参数 C 取三个不同的值时模型学到的系数
​
plt.plot(logreg.coef_.T, 'o', label="C=1")
plt.plot(logreg100.coef_.T, '^', label="C=100")
plt.plot(logreg001.coef_.T, 'v', label="C=0.001")
plt.xticks(range(cancer.data.shape[1]), cancer.feature_names, rotation=90)
xlims = plt.xlim()
plt.hlines(0, xlims[0], xlims[1])
plt.xlim(xlims)
plt.ylim(-5, 5)
plt.xlabel("Feature")
plt.ylabel("Coefficient magnitude")
plt.legend()
​
​
# 由于 LogisticRegression 默认应用 L2 正则化,所以其结果与图 2-12 中
# Ridge 的结果类似。更强的正则化使得系数更趋向于 0,但系数永远不会正
# 好等于 0。进一步观察图像,还可以在第 3 个系数那里发现有趣之处,这个
# 系数是“平均周长”(mean perimeter)。C=100 和 C=1 时,这个系数为负,而
# C=0.001 时这个系数为正,其绝对值比 C=1 时还要大。在解释这样的模型时,
# 人们可能会认为,系数可以告诉我们某个特征与哪个类别有关。例如,人
# 们可能会认为高“纹理错误”(texture error)特征与“恶性”样本有关。

在这里插入图片描述

查看L1正则化的影响

如果想要一个可解释性更强的模型,使用 L1 正则化可能更好,因为它约束模型只使用少
# 数几个特征。下面是使用 L1 正则化的系数图像和分类精度
# 如果想要一个可解释性更强的模型,使用 L1 正则化可能更好,因为它约束模型只使用少
# 数几个特征。下面是使用 L1 正则化的系数图像和分类精度for C, marker in zip([0.001, 1, 100], ['o', '^', 'v']):
#     选择正则化类型  penalty
    lr_l1 = LogisticRegression(C=C, penalty="l1").fit(X_train, y_train)
    print("Training accuracy of l1 logreg with C={:.3f}: {:.2f}".format(
          C, lr_l1.score(X_train, y_train)))
    print("Test accuracy of l1 logreg with C={:.3f}: {:.2f}".format(
          C, lr_l1.score(X_test, y_test)))
    plt.plot(lr_l1.coef_.T, marker, label="C={:.3f}".format(C))
​
plt.xticks(range(cancer.data.shape[1]), cancer.feature_names, rotation=90)
xlims = plt.xlim()
plt.hlines(0, xlims[0], xlims[1])
plt.xlim(xlims)
plt.xlabel("Feature")
plt.ylabel("Coefficient magnitude")
​
plt.ylim(-5, 5)
plt.legend(loc=3)
Training accuracy of l1 logreg with C=0.001: 0.91
Test accuracy of l1 logreg with C=0.001: 0.92
Training accuracy of l1 logreg with C=1.000: 0.96
Test accuracy of l1 logreg with C=1.000: 0.96
Training accuracy of l1 logreg with C=100.000: 0.99
Test accuracy of l1 logreg with C=100.000: 0.98
<matplotlib.legend.Legend at 0x7f072edbdd68>

在这里插入图片描述

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!