K折交叉验证:
sklearn.model_selection.KFold(n_splits=3, shuffle=False, random_state=None)
思路:将训练/测试数据集划分n_splits个互斥子集,每次用其中一个子集当作验证集,剩下的n_splits-1个作为训练集,进行n_splits次训练和测试,得到n_splits个结果
注意点:对于不能均等份的数据集,其前n_samples % n_splits子集拥有n_samples // n_splits + 1个样本,其余子集都只有n_samples // n_splits样本
参数说明:
n_splits:表示划分几等份
shuffle:在每次划分时,是否进行洗牌
①若为Falses时,其效果等同于random_state等于整数,每次划分的结果相同
②若为True时,每次划分的结果都不一样,表示经过洗牌,随机取样的
random_state:随机种子数
属性:
①get_n_splits(X=None, y=None, groups=None):获取参数n_splits的值
②split(X, y=None, groups=None):将数据集划分成训练集和测试集,返回索引生成器
通过一个不能均等划分的栗子,设置不同参数值,观察其结果
①设置shuffle=False,运行两次,发现两次结果相同
from mlxtend.classifier import StackingClassifier
sclf = StackingClassifier(classifiers=[lgb], meta_classifier=xgb_model)
sclf_score=sclf.fit(train,target)
test_predict=sclf.predict(test)
from sklearn.metrics import r2_score
def online_score(pred):
print("预测结果最大值:{},预测结果最小值:{}".format(pred.max(),pred.min()))
# a榜测分
conmbine1 = pd.read_csv(r'C:\Users\lxc\Desktop\featurecup\sub_b_919.csv',engine = "python",header=None)
score1 = r2_score(pred, conmbine1)
print("对比919分数:{}".format(score1))
score = online_score(test_predict)
预测结果最大值:19051.067151217972,预测结果最小值:1199.97082591554
对比919分数:0.981891385946527
Stacking
#!pip install mlxtend
import warnings
warnings.filterwarnings('ignore')
import itertools
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from mlxtend.plotting import plot_learning_curves
from mlxtend.plotting import plot_decision_regions
# 以python自带的鸢尾花数据集为例
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],
meta_classifier=lr)
label = ['KNN', 'Random Forest', 'Naive Bayes', 'Stacking Classifier']
clf_list = [clf1, clf2, clf3, sclf]
fig = plt.figure(figsize=(10,8))
gs = gridspec.GridSpec(2, 2)
grid = itertools.product([0,1],repeat=2)
clf_cv_mean = []
clf_cv_std = []
for clf, label, grd in zip(clf_list, label, grid):
scores = cross_val_score(clf, X, y, cv=3, scoring='accuracy')
print("Accuracy: %.2f (+/- %.2f) [%s]" %(scores.mean(), scores.std(), label))
clf_cv_mean.append(scores.mean())
clf_cv_std.append(scores.std())
clf.fit(X, y)
ax = plt.subplot(gs[grd[0], grd[1]])
fig = plot_decision_regions(X=X, y=y, clf=clf)
plt.title(label)
plt.show()
Blending
Blending方法是区别于bagging和boosting的另一种集成模型的方法。
在已经得到多个弱学习器的状况下,如何将这些弱学习器的预测值联合起来,得到更好的预测值,就是Blending做的事情。
def blend(train,test,target):
'''5折'''
# n_flods = 5
# skf = list(StratifiedKFold(y, n_folds=n_flods))
'''切分训练数据集为d1,d2两部分'''
X_d1, X_d2, y_d1, y_d2 = train_test_split(train, target, test_size=0.5, random_state=914)
train_ = np.zeros((X_d2.shape[0],len(clfs*3)))
test_ = np.zeros((test.shape[0],len(clfs*3)))
for j,clf in enumerate(clfs):
'''依次训练各个单模型'''
# print(j, clf)
'''使用第1个部分作为预测,第2部分来训练模型,获得其预测的输出作为第2部分的新特征。'''
# X_train, y_train, X_test, y_test = X[train], y[train], X[test], y[test]
X_d1fillna=X_d1.fillna(0)
X_d2fillna = X_d2.fillna(0)
X_predictfillna= test.fillna(0)
clf.fit(X_d1fillna,y_d1)
y_submission = clf.predict(X_d2fillna)
y_test_submission = clf.predict(X_predictfillna)
train_[:,j*3] = y_submission*y_submission
'''对于测试集,直接用这k个模型的预测值作为新的特征。'''
test_[:, j*3] = y_test_submission*y_test_submission
train_[:, j+1] =(y_submission - y_submission.min()) /(y_submission.max() - y_submission.min())
'''对于测试集,直接用这k个模型的预测值作为新的特征。'''
y_test_submission = (y_test_submission - y_test_submission.min()) / \
(y_test_submission.max() - y_test_submission.min())
test_[:, j+1] = y_test_submission
train_[:, j+2] = np.log(y_submission)
'''对于测试集,直接用这k个模型的预测值作为新的特征。'''
y_test_submission =np.log(y_test_submission)
test_[:, j+2] = y_test_submission
# print("val auc Score: %f" % r2_score(y_predict, dataset_d2[:, j]))
print('已完成第',j)
train_.to_csv('./input/train_blending.csv', index=False)
test_.to_csv('./input/test_blending.csv', index=False)
来源:CSDN
作者:weixin_43559291
链接:https://blog.csdn.net/weixin_43559291/article/details/104031948