LogisticRegression逻辑斯特回归性能分析_学习曲线
L2正则化
# 我们在乳腺癌数据集上详细分析 LogisticRegression
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, stratify=cancer.target, random_state=42)
logreg = LogisticRegression().fit(X_train, y_train)
print("Training set score: {:.3f}".format(logreg.score(X_train, y_train)))
print("Test set score: {:.3f}".format(logreg.score(X_test, y_test)))
# C=1 的默认值给出了相当好的性能,在训练集和测试集上都达到 95% 的精度。但由于训练
# 集和测试集的性能非常接近,所以模型很可能是欠拟合的。我们尝试增大 C 来拟合一个更
# 灵活的模型:
Training set score: 0.946
Test set score: 0.958
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:939: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
logreg100 = LogisticRegression(C=100).fit(X_train, y_train)
print("Training set score: {:.3f}".format(logreg100.score(X_train, y_train)))
print("Test set score: {:.3f}".format(logreg100.score(X_test, y_test)))
# 使用 C=100 可以得到更高的训练集精度,也得到了稍高的测试集精度,这也证实了我们的
# 直觉,即更复杂的模型应该性能更好。
Training set score: 0.946
Test set score: 0.965
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:939: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
# 我们还可以研究使用正则化更强的模型时会发生什么。设置 C=0.01:
logreg001 = LogisticRegression(C=0.01).fit(X_train, y_train)
print("Training set score: {:.3f}".format(logreg001.score(X_train, y_train)))
print("Test set score: {:.3f}".format(logreg001.score(X_test, y_test)))
Training set score: 0.934
Test set score: 0.930
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:939: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
# 看一下正则化参数 C 取三个不同的值时模型学到的系数
plt.plot(logreg.coef_.T, 'o', label="C=1")
plt.plot(logreg100.coef_.T, '^', label="C=100")
plt.plot(logreg001.coef_.T, 'v', label="C=0.001")
plt.xticks(range(cancer.data.shape[1]), cancer.feature_names, rotation=90)
xlims = plt.xlim()
plt.hlines(0, xlims[0], xlims[1])
plt.xlim(xlims)
plt.ylim(-5, 5)
plt.xlabel("Feature")
plt.ylabel("Coefficient magnitude")
plt.legend()
# 由于 LogisticRegression 默认应用 L2 正则化,所以其结果与图 2-12 中
# Ridge 的结果类似。更强的正则化使得系数更趋向于 0,但系数永远不会正
# 好等于 0。进一步观察图像,还可以在第 3 个系数那里发现有趣之处,这个
# 系数是“平均周长”(mean perimeter)。C=100 和 C=1 时,这个系数为负,而
# C=0.001 时这个系数为正,其绝对值比 C=1 时还要大。在解释这样的模型时,
# 人们可能会认为,系数可以告诉我们某个特征与哪个类别有关。例如,人
# 们可能会认为高“纹理错误”(texture error)特征与“恶性”样本有关。
查看L1正则化的影响
如果想要一个可解释性更强的模型,使用 L1 正则化可能更好,因为它约束模型只使用少
# 数几个特征。下面是使用 L1 正则化的系数图像和分类精度
# 如果想要一个可解释性更强的模型,使用 L1 正则化可能更好,因为它约束模型只使用少
# 数几个特征。下面是使用 L1 正则化的系数图像和分类精度
for C, marker in zip([0.001, 1, 100], ['o', '^', 'v']):
# 选择正则化类型 penalty
lr_l1 = LogisticRegression(C=C, penalty="l1").fit(X_train, y_train)
print("Training accuracy of l1 logreg with C={:.3f}: {:.2f}".format(
C, lr_l1.score(X_train, y_train)))
print("Test accuracy of l1 logreg with C={:.3f}: {:.2f}".format(
C, lr_l1.score(X_test, y_test)))
plt.plot(lr_l1.coef_.T, marker, label="C={:.3f}".format(C))
plt.xticks(range(cancer.data.shape[1]), cancer.feature_names, rotation=90)
xlims = plt.xlim()
plt.hlines(0, xlims[0], xlims[1])
plt.xlim(xlims)
plt.xlabel("Feature")
plt.ylabel("Coefficient magnitude")
plt.ylim(-5, 5)
plt.legend(loc=3)
Training accuracy of l1 logreg with C=0.001: 0.91
Test accuracy of l1 logreg with C=0.001: 0.92
Training accuracy of l1 logreg with C=1.000: 0.96
Test accuracy of l1 logreg with C=1.000: 0.96
Training accuracy of l1 logreg with C=100.000: 0.99
Test accuracy of l1 logreg with C=100.000: 0.98
<matplotlib.legend.Legend at 0x7f072edbdd68>
来源:CSDN
作者:御剑归一
链接:https://blog.csdn.net/wj1298250240/article/details/103805082