鸢尾花数据集的分类问题指导 -- 对数几率回归（逻辑回归）问题研究（2）

用逻辑回归实现鸢尾花的分类（1）中，我们了解了鸢尾花数据集中的特征数据等信息，并尝试使用Logistic Regression方法基于scikit提供的iris数据集做简单的分类。这篇进阶版会带大家来学习如何将原始文件中的数据转变成机器学习算法可用的numpy数据。相信这对于无论是入门数据分析者还是有一定数据分析基础的数据分析师，都是在实际操作处理现实原始数据时最犯难的一件事。巧妇难为无米之炊，没有能够进行机器学习的数据，纵然你有各种机器学习算法的能力与技能，还是一样不能做好一个数据分析项目。

所以这次会带大家来了解一下基于平衡的样本下一个大体完整的数据处理与分析过程。此外，这篇notebook会在模型构造时运用sklearn中的一个有意思的类：Pipeline，即管道机制，来实现流式教程的封装与管理（streaming workflows with pipelines）。

pipeline of transforms with a final estimator.

import pandas as pd import numpy as np import matplotlib.pyplot as plt  import plotly.plotly as py import plotly.graph_objs as go   from plotly.offline import init_notebook_mode, iplot init_notebook_mode(connected = True)  from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn import metrics

1. 加载鸢尾花数据集

iris_path = '/home/kesci/input/iris/iris.csv' iris = pd.read_csv(iris_path)

iris.head()

Out[3]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

1.1. 从csv文件数据到Numpy数据的构造过程

Step 1：构造映射函数iris_type。因为实际数据中，label并不都是便于学习分类的数字型，而是string类型。 Step 2：对于文本类的label, 将label列的所有内容都转变成映射函数的输出 Step 3：将DataFrame转换成numpy矩阵 Step 4：划分训练集与测试集

# S1: # 映射函数iris_type: 将string的label映射至数字label # s: 品种的名字 def iris_type(s):     class_label = {'setosa':0, 'versicolor':1, 'virginica':2}     return class_label[s]

# S2: 将第4列内容映射至iris_type函数定义的内容 pd.io.parsers.read_csv(iris_path, converters = {4:iris_type}).head()

Out[5]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

# S3: 将上面转变过且解析至dataframe的 data = np.array(pd.io.parsers.read_csv(iris_path, converters = {4:iris_type})) data[:10,:]        # 查看前10行的数据

Out[6]:

array([[5.1, 3.5, 1.4, 0.2, 0. ],        [4.9, 3. , 1.4, 0.2, 0. ],        [4.7, 3.2, 1.3, 0.2, 0. ],        [4.6, 3.1, 1.5, 0.2, 0. ],        [5. , 3.6, 1.4, 0.2, 0. ],        [5.4, 3.9, 1.7, 0.4, 0. ],        [4.6, 3.4, 1.4, 0.3, 0. ],        [5. , 3.4, 1.5, 0.2, 0. ],        [4.4, 2.9, 1.4, 0.2, 0. ],        [4.9, 3.1, 1.5, 0.1, 0. ]])

# Step 4:将原始数据集划分成训练集与测试集  # 用np.split按列（axis=1）进行分割 # (4,):分割位置，前4列作为x的数据，第4列之后都是y的数据 x,y = np.split(data, (4,), axis = 1)   X = x[:,0:2] # 取前两列特征 # 用train_test_split将数据按照7：3的比例分割训练集与测试集， # 随机种子设为1（每次得到一样的随机数），设为0或不设（每次随机数都不同） x_train, x_test, y_train,y_test = train_test_split(x,y,test_size = 0.3,random_state = 0)

2. 模型的搭建与训练

Pipeline(steps) 利用sklearn提供的管道机制Pipeline来实现对全部步骤的流式化封装与管理。
- 数据标准化 StandardScaler()
- PCA降维处理取2个重要特征
- 最终环节：逻辑回归分类器

pipe_LR = Pipeline([                     # ('sc', StandardScaler()),                     # ('pca', PCA(n_components = 2)),                     ('clf', LogisticRegression(random_state=1))                     ]) # 开始训练 pipe_LR.fit(x_train, y_train)

/opt/conda/lib/python3.5/site-packages/sklearn/utils/validation.py:578: DataConversionWarning:  A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().

Out[8]:

Pipeline(memory=None,      steps=[('clf', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,           intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,           penalty='l2', random_state=1, solver='liblinear', tol=0.0001,           verbose=0, warm_start=False))])

3. 分类器评估

3.1. 准确率

print("训练集准确率: %0.3f" %pipe_LR.score(x_train, y_train))

训练集准确率: 0.943

print("测试集准确率: %0.3f" %pipe_LR.score(x_test, y_test))

测试集准确率: 0.889

y_hat = pipe_LR.predict(x_test) accuracy = metrics.accuracy_score(y_test, y_hat) print("逻辑回归分类器的准确率：%0.3f" % accuracy)

逻辑回归分类器的准确率：0.889

3.2.分类器的分类报告总结

精确度（Precision）
召回率(Recall)
F1 Score

target_names = ['setosa', 'versicolor', 'virginica'] print(metrics.classification_report(y_test, y_hat, target_names = target_names))

             precision    recall  f1-score   support       setosa       1.00      1.00      1.00        16  versicolor       1.00      0.72      0.84        18   virginica       0.69      1.00      0.81        11  avg / total       0.92      0.89      0.89        45

参考文献

sklearn 中的 Pipeline 机制

文章来源: 用逻辑回归实现鸢尾花数据集分类（2）

标签

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2