framing
- 标签:我们要预测的真实事物:y
- 基本线性回归中的y变量
- 特征:描述数据的输入变量:
- 基本线性回归中的 变量
- 样本:数据的特定实例:
- 有标签样本:<特征,标签>;(x, y)
- 用于训练模型
- 无标签样本 <特征,?>:(x,?)
- 用于对新数据做出预测
- 模型:可以将样本映射到预测标签:
- 预测由模型内部参数定义,这些内部参数是通过学习得到的
合适的特征应该是具体且可以量化的。漂不漂亮等无法量化,太主观,能否转化为其他具体特征。比如鞋子的颜色、样式等具体的方面。
- (bias)在有的机器学习教材中也写做
- 损失函数:
均方误差MSE:
除了MSE作为损失函数,也有其他损失函数,MSE不是唯一的,也不是适用于所有情形的最佳损失函数 - 神经网络非凸,落到哪个最小值很大程度上取决于初始值
- 小批量随机梯度下降法
tensorflow
结构:
分为:
图协议缓冲区
执行(分布式)图的运行时
前者类似于java编译器,后者类似于JVM
学习高级API:tensorflow.estimators
pandas
# coding:utf-8 import pandas as pd import numpy as np import matplotlib.pyplot as plt city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento']) population = pd.Series([852469, 1015785, 485199]) cities = pd.DataFrame({'City name': city_names, 'Population': population}) print() print(cities.head()) print() print(type(cities['City name'])) print() print(cities['City name']) cities['Area square miles'] = pd.Series([46.87, 176.53, 97.92]) cities['Population density'] = cities['Population'] / cities['Area square miles'] cities['is wide and has saint name'] = (cities['Area square miles'] > 50) & cities['City name'].apply( lambda name: name.startswith('San')) print() print(cities) cities_1 = cities.reindex([2, 0, 1]) # 索引不变,位置发生变化 print() print(cities_1.head()) # cities 不变,生成新的DataFrame # 一般情况下,在开始创建Series和DataFrame的时候,会按照源数据的顺序添加索引 # 索引一旦生成,就永远不会变,索引是稳定的 # 即使数据的排列顺序发生了变化,也不会改变 cities_2 = cities.reindex(np.random.permutation(cities.index)) # pd.set_option('max_columns', 5) print() print(cities_2.head()) cities_3 = cities.reindex([2, 3, 4]) # 允许在reinde中添加新的索引,并填充NaN print() print(cities_3.head()) #输出: /Users/tu/PycharmProjects/myFirstPythonDir/venv/bin/python /Users/tu/PycharmProjects/myFirstPythonDir/mytest/numpyDemo/googlepandas.py City name Population 0 San Francisco 852469 1 San Jose 1015785 2 Sacramento 485199 <class 'pandas.core.series.Series'> 0 San Francisco 1 San Jose 2 Sacramento Name: City name, dtype: object City name ... is wide and has saint name 0 San Francisco ... False 1 San Jose ... True 2 Sacramento ... False [3 rows x 5 columns] City name ... is wide and has saint name 2 Sacramento ... False 0 San Francisco ... False 1 San Jose ... True [3 rows x 5 columns] City name ... is wide and has saint name 0 San Francisco ... False 2 Sacramento ... False 1 San Jose ... True [3 rows x 5 columns] City name ... is wide and has saint name 2 Sacramento ... False 3 NaN ... NaN 4 NaN ... NaN [3 rows x 5 columns] Process finished with exit code 0
使用tensorflow estimator训练一个预测房价的线性回归模型
- 过拟合:
模型在训练集数据上损失很低,在测试集数据上很高。因为模型拟合的过于复杂。
为此,机器学习必须有奥卡姆剃刀原则 - 监督学习数据要求:
独立同分布
分布不会发生变化
从同一个分布中抽取样本
将数据集分为训练集和测试集,用训练集训练,测试集评估,根据评估的结果调整超参数,再次用训练集训练,如此反复下去,模型会在测试集上过拟合,测试集也丧失了测试拟合程度的意义。
所以需要再划分:训练集,交叉验证集,测试集
不断用验证集和测试集会导致效果降低。
即,不断地依靠验证集以及之后的测试集的次数越多,最后对于数据是否能泛化到没见过的新数据的信息就越低。
所以需要更多的数据来更新测试集和验证集。
机器学习的调试:
很多时候都是在对数据调试,而不是代码。
- 不随机数据,划分数据集:
- 随机后划分:
测试集、训练集和交叉验证集的分布一定要大致一致。
误差在三个数据集上的表现。
测试数据集链接
代码如下:
import math from IPython import display from matplotlib import cm from matplotlib import gridspec from matplotlib import pyplot as plt import os import numpy as np import pandas as pd from sklearn import metrics import tensorflow as tf from tensorflow.python.data import Dataset os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' tf.logging.set_verbosity(tf.logging.ERROR) # DEBUG INFO WARN ERROR FATAL pd.options.display.max_rows = 10 pd.options.display.max_columns = 9 pd.options.display.float_format = '{:.1f}'.format # 加载数据集 california_housing_dataframe = pd.read_csv("california_housing_train.csv", sep=',') california_housing_test_dataframe = pd.read_csv("california_housing_test.csv", sep=',') # 随机数据,很重要的一步 california_housing_dataframe = california_housing_dataframe.reindex( np.random.permutation(california_housing_dataframe.index)) def process_feature(california_housing_dataframe): selected_feature = california_housing_dataframe[ ["longitude", "latitude", "housing_median_age", "total_rooms", "total_bedrooms", "population", "households", "median_income"]] processed_feature = selected_feature.copy() processed_feature['rooms_per_population'] = processed_feature['total_rooms'] / processed_feature['population'] return processed_feature def process_target(california_housing_dataframe): output_target = pd.DataFrame() output_target['median_house_value'] = california_housing_dataframe['median_house_value'] / 1000.0 return output_target # 数据被分为训练集、验证集 train_examples = process_feature(california_housing_dataframe.head(12000)) train_targets = process_target(california_housing_dataframe.head(12000)) # print("\n训练集:") # print(train_examples.describe()) # print(train_targets.describe()) validation_examples = process_feature(california_housing_dataframe.tail(5000)) validation_targets = process_target(california_housing_dataframe.tail(5000)) # print('\n交叉验证集:') # print(validation_examples.describe()) # print(validation_targets.describe()) # # print('\n没有测试集') # 检查数据,绘制经纬度图 # plt.figure(figsize=(13, 8)) # # ax = plt.subplot(1, 2, 1) # ax.set_title('Valication Data') # ax.set_autoscaley_on(False) # ax.set_ylim([32, 43]) # ax.set_autoscalex_on(False) # ax.set_xlim([-126, -112]) # plt.scatter(validation_examples['longitude'], validation_examples['latitude'], cmap='coolwarm', # c=validation_targets['median_house_value'] / validation_targets['median_house_value'].max()) # # ax = plt.subplot(1, 2, 2) # ax.set_title('Train Data') # ax.set_autoscaley_on(False) # ax.set_ylim([32, 43]) # ax.set_autoscalex_on(False) # ax.set_xlim(-126, -112) # plt.scatter(train_examples['longitude'], train_examples['latitude'], cmap='coolwarm', # c=train_targets['median_house_value'] / train_targets['median_house_value'].max()) test_examples = process_feature(california_housing_test_dataframe) test_targets = process_target(california_housing_test_dataframe) # 4.定义输入函数 def my_input_fn(features, targets, batch_size=1, shuffle=True, num_epochs=None): """ 输入函数 :param features: 输入特征 :param targets: 数据标签 :param batch_size: 输出数据的大小 :param shuffle: 随机抽取数据 :param num_epochs:重复的次数 :return:数据和标签 """ features = {key: np.array(value) for key, value in dict(features).items()} ds = Dataset.from_tensor_slices((features, targets)) # 2GB限制 ds = ds.batch(batch_size).repeat(num_epochs) if shuffle: ds = ds.shuffle(buffer_size=10000) features, labels = ds.make_one_shot_iterator().get_next() return features, labels def construct_feature_columns(input_features): return set([tf.feature_column.numeric_column(my_feature) for my_feature in input_features]) def train_model(learning_rate, steps, batch_size, train_examples, train_targets, validation_examples, validation_targets, test_examples, test_targets, periods=10): steps_per_periods = steps / periods # 每次报告时所走的步长 # 最优化函数 my_optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate) my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0) # 梯度裁剪 # 模型 linear_regressor = tf.estimator.LinearRegressor(feature_columns=construct_feature_columns(train_examples), optimizer=my_optimizer) # 定义输入函数 training_input_fn = lambda: my_input_fn(train_examples, train_targets['median_house_value'], batch_size=batch_size) prediction_training_input_fn = lambda: my_input_fn(train_examples, train_targets['median_house_value'], num_epochs=1, shuffle=False) prediction_validation_input_fn = lambda: my_input_fn(validation_examples, validation_targets['median_house_value'], num_epochs=1, shuffle=False) prediction_test_input_fn = lambda: my_input_fn(test_examples, test_targets['median_house_value'], num_epochs=1, shuffle=False) print('Training model ...') print('RMSE:') training_rmse = [] validation_rmse = [] test_rmse = [] for period in range(0, periods): linear_regressor.train(input_fn=training_input_fn, steps=steps_per_periods) training_predictions = linear_regressor.predict(input_fn=prediction_training_input_fn) training_predictions = np.array([item['predictions'][0] for item in training_predictions]) # item是这样的:{'predictions': array([0.015675], dtype=float32)} validation_predictions = linear_regressor.predict(input_fn=prediction_validation_input_fn) validation_predictions = np.array([item['predictions'][0] for item in validation_predictions]) test_predictions = linear_regressor.predict(input_fn=prediction_test_input_fn) test_predictions = np.array([item['predictions'][0] for item in test_predictions]) # 误差 training_root_mean_squared_error = math.sqrt(metrics.mean_squared_error(training_predictions, train_targets)) validation_root_mean_squared_error = math.sqrt( metrics.mean_squared_error(validation_predictions, validation_targets)) test_root_mean_squared_error = math.sqrt(metrics.mean_squared_error(test_predictions, test_targets)) print('period %02d : %.2f' % (period, training_root_mean_squared_error)) training_rmse.append(training_root_mean_squared_error) validation_rmse.append(validation_root_mean_squared_error) test_rmse.append(test_root_mean_squared_error) print('Model training finished.') plt.figure() plt.ylabel('RMSE') plt.xlabel('Periods') plt.title('Root mean squared error vs. periods') plt.tight_layout() plt.plot(training_rmse, label='training') plt.plot(validation_rmse, label='validation') plt.plot(test_rmse, label='test') plt.legend() plt.show() return linear_regressor train_model(learning_rate=0.00003, steps=5000, batch_size=5, train_examples=train_examples, train_targets=train_targets, validation_examples=validation_examples, validation_targets=validation_targets, test_examples=test_examples, test_targets=test_targets, periods=100)
将原始数据转换成特征矢量,叫特征工程。
- 数值型特征直接照搬
- 字符串one hot编码:
- 先对字符串数据整理词汇表,同时有一个词汇表中没有的其他类
- 对词汇表进行one hot编码
- 类别数据:布尔类型
- 避免使用特征值出现频率很少的那种特征,很离散的特征值。比如预测人的时候用身份证号码作为特征,根本就没有重复的身份证号。不适合。
- 含义清晰,人人能懂
- 将异常值从实际的数据中剔除
- 考虑到不稳定性,数据最好是稳定的。
即使是少量的异常数据也会破坏掉一个大规模数据集。
- 改善梯度下降速度
- 避免NaN陷阱
- 节省模型精力
- 对数缩放,尾巴减小
- 限制数据的范围,尾巴消失,边界出现小峰值
将浮点数特征分成离散特征(一个矢量),可以均分,也可以按照分位数分;
- 把不可靠样本干掉:
遗漏值:某个样本的一个特征没有特征值
重复样本
错误的标签
错误的特征值
这样的样本都从数据集中筛除
- 检查出不良的数据,用直方图、最大值最小值、均值、中位数、标准差。
- 检查离散特征最常见特征值的列表,看是否符合预期。
知道预期的数据状态,并检查手上的数据是否满足预期,或者解释为什么不满足预期,检查训练数据和其他来源的数据是否一致。
范围在[-1, 1]之间。
程序根据相关性来选择特征,同时对某些特征进行了分桶
选择了收入中位数和纬度作为特征
对纬度进行分桶,将浮点数分为整数桶,效果显著
改变学习率(变大),加大学习steps:
对数据进行清理,之后分桶,大大减少了误差
""" 创建一个集合:用更少的特征取得跟复杂特征效果一样好的成果 特征少,模型使用的资源就少,更加易于维护 """ import math from IPython import display from matplotlib import cm from matplotlib import gridspec from matplotlib import pyplot as plt import os import numpy as np import pandas as pd from sklearn import metrics import tensorflow as tf from tensorflow.python.data import Dataset os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' tf.logging.set_verbosity(tf.logging.ERROR) # DEBUG INFO WARN ERROR FATAL pd.options.display.max_rows = 10 pd.options.display.max_columns = 10 pd.options.display.float_format = '{:.1f}'.format # 加载数据集 california_housing_dataframe = pd.read_csv("california_housing_train.csv", sep=',') california_housing_test_dataframe = pd.read_csv("california_housing_test.csv", sep=',') # 随机数据,很重要的一步 california_housing_dataframe = california_housing_dataframe.reindex( np.random.permutation(california_housing_dataframe.index)) def process_feature(california_housing_dataframe): selected_feature = california_housing_dataframe[ ["longitude", "latitude", "housing_median_age", "total_rooms", "total_bedrooms", "population", "households", "median_income"]] processed_feature = selected_feature.copy() processed_feature['rooms_per_population'] = processed_feature['total_rooms'] / processed_feature['population'] return processed_feature def process_target(california_housing_dataframe): output_target = pd.DataFrame() output_target['median_house_value'] = california_housing_dataframe['median_house_value'] / 1000.0 return output_target # 数据被分为训练集、验证集 train_examples = process_feature(california_housing_dataframe.head(12000)) train_targets = process_target(california_housing_dataframe.head(12000)) print("\n训练集:") display.display(train_examples.describe()) print("\n训练集标签:") display.display(train_targets.describe()) validation_examples = process_feature(california_housing_dataframe.tail(5000)) validation_targets = process_target(california_housing_dataframe.tail(5000)) print('\n交叉验证集:') display.display(validation_examples.describe()) print("\n交叉验证集标签:") display.display(validation_targets.describe()) test_examples = process_feature(california_housing_test_dataframe) test_targets = process_target(california_housing_test_dataframe) print('\n测试集:') display.display(test_examples.describe()) print("\n测试集标签:") display.display(test_targets.describe()) # 检查数据,绘制经纬度图 # plt.figure(figsize=(13, 8)) # # ax = plt.subplot(1, 3, 1) # ax.set_title('Valication Data') # ax.set_autoscaley_on(False) # ax.set_ylim([32, 43]) # ax.set_autoscalex_on(False) # ax.set_xlim([-126, -112]) # plt.scatter(validation_examples['longitude'], validation_examples['latitude'], cmap='coolwarm', # c=validation_targets['median_house_value'] / validation_targets['median_house_value'].max()) # # ax = plt.subplot(1, 3, 2) # ax.set_title('Train Data') # ax.set_autoscaley_on(False) # ax.set_ylim([32, 43]) # ax.set_autoscalex_on(False) # ax.set_xlim(-126, -112) # plt.scatter(train_examples['longitude'], train_examples['latitude'], cmap='coolwarm', # c=train_targets['median_house_value'] / train_targets['median_house_value'].max()) # # ax = plt.subplot(1, 3, 3) # ax.set_title('Test Data') # ax.set_autoscaley_on(False) # ax.set_ylim([32, 43]) # ax.set_autoscalex_on(False) # ax.set_xlim(-126, -112) # plt.scatter(test_examples['longitude'], test_examples['latitude'], cmap='coolwarm', # c=test_targets['median_house_value'] / test_targets['median_house_value'].max()) # # plt.show() """ 构建良好的特征集 用相关矩阵,找出原始特征之间的相关性 要有与目标有相关性的特征,也要有独立的特征; """ correlation_dataframe = train_examples.copy() correlation_dataframe['target'] = train_targets.copy() correlation_metrix = correlation_dataframe.corr() print("\n相关矩阵:") display.display(correlation_metrix) ''' longitude latitude 负相关 -0.9 total_rooms total_bedrooms population households 正向关 0.9 1.0 median_income target 正向关 0.7;即与目标相关的特征为median_income 根据相关矩阵,合成特征,移除特征, ''' # 4.定义输入函数 def my_input_fn(features, targets, batch_size=1, shuffle=True, num_epochs=None): """ 输入函数 :param features: 输入特征 :param targets: 数据标签 :param batch_size: 输出数据的大小 :param shuffle: 随机抽取数据 :param num_epochs:重复的次数 :return:数据和标签 """ features = {key: np.array(value) for key, value in dict(features).items()} ds = Dataset.from_tensor_slices((features, targets)) # 2GB限制 ds = ds.batch(batch_size).repeat(num_epochs) if shuffle: ds = ds.shuffle(buffer_size=10000) features, labels = ds.make_one_shot_iterator().get_next() return features, labels def construct_feature_columns(input_features): return set([tf.feature_column.numeric_column(my_feature) for my_feature in input_features]) def train_model(learning_rate, steps, batch_size, train_examples, train_targets, validation_examples, validation_targets, test_examples, test_targets, periods=10): steps_per_periods = steps / periods # 每次报告时所走的步长 # 最优化函数 my_optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate) my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0) # 梯度裁剪 # 模型 linear_regressor = tf.estimator.LinearRegressor(feature_columns=construct_feature_columns(train_examples), optimizer=my_optimizer) # 定义输入函数 training_input_fn = lambda: my_input_fn(train_examples, train_targets['median_house_value'], batch_size=batch_size) prediction_training_input_fn = lambda: my_input_fn(train_examples, train_targets['median_house_value'], num_epochs=1, shuffle=False) prediction_validation_input_fn = lambda: my_input_fn(validation_examples, validation_targets['median_house_value'], num_epochs=1, shuffle=False) prediction_test_input_fn = lambda: my_input_fn(test_examples, test_targets['median_house_value'], num_epochs=1, shuffle=False) print('Training model ...') print('RMSE:') training_rmse = [] validation_rmse = [] test_rmse = [] for period in range(0, periods): linear_regressor.train(input_fn=training_input_fn, steps=steps_per_periods) training_predictions = linear_regressor.predict(input_fn=prediction_training_input_fn) training_predictions = np.array([item['predictions'][0] for item in training_predictions]) # item是这样的:{'predictions': array([0.015675], dtype=float32)} validation_predictions = linear_regressor.predict(input_fn=prediction_validation_input_fn) validation_predictions = np.array([item['predictions'][0] for item in validation_predictions]) test_predictions = linear_regressor.predict(input_fn=prediction_test_input_fn) test_predictions = np.array([item['predictions'][0] for item in test_predictions]) # 误差 training_root_mean_squared_error = math.sqrt(metrics.mean_squared_error(training_predictions, train_targets)) validation_root_mean_squared_error = math.sqrt( metrics.mean_squared_error(validation_predictions, validation_targets)) test_root_mean_squared_error = math.sqrt(metrics.mean_squared_error(test_predictions, test_targets)) print('period %02d : %.2f' % (period, training_root_mean_squared_error)) training_rmse.append(training_root_mean_squared_error) validation_rmse.append(validation_root_mean_squared_error) test_rmse.append(test_root_mean_squared_error) print('Model training finished.') plt.figure() plt.ylabel('RMSE') plt.xlabel('Periods') plt.title('Root mean squared error vs. periods') plt.tight_layout() plt.plot(training_rmse, label='training') plt.plot(validation_rmse, label='validation') plt.plot(test_rmse, label='test') plt.legend() plt.show() return linear_regressor ''' longitude latitude 负相关 -0.9 total_rooms total_bedrooms population households 正向关 0.9 ~ 1.0 median_income target 正向关 0.7;即与目标相关的特征为median_income housing_median_age 与 total_rooms total_bedrooms population households 负相关 -0.3 ~ -0.4 根据相关矩阵,合成特征,移除特征, ''' # minimal_features = ["latitude", "median_income"] # # assert minimal_features, "至少必须有一个特征" # # minimal_features_train_examples = train_examples[minimal_features] # minimal_features_validation_examples = validation_examples[minimal_features] # minimal_features_test_examples = test_examples[minimal_features] # plt.scatter(train_examples["latitude"], train_targets['median_house_value']) # plt.show() # 分桶 def select_and_transform_features(source_df): LATITUDE_RANGES = zip(range(32, 42), range(33, 43)) selected_examples = pd.DataFrame() selected_examples['median_income'] = source_df['median_income'].copy() for r in LATITUDE_RANGES: selected_examples['latitude_%d_%d' % r] = source_df['latitude'].apply( lambda l: 1 if r[0] <= l < r[1] else 0) return selected_examples selected_train_examples = select_and_transform_features(train_examples) selected_validation_examples = select_and_transform_features(validation_examples) selected_test_examples = select_and_transform_features(test_examples) # 减少特征后,学习率降低,运算负担减轻 train_model(learning_rate=0.1, steps=2000, batch_size=5, train_examples=selected_train_examples, train_targets=train_targets, validation_examples=selected_validation_examples, validation_targets=validation_targets, test_examples=selected_test_examples, test_targets=test_targets, periods=20)
用特征组合+大数据是学习复杂模型的有效策略
神经网络是另外一种策略
组合特征在这里可以理解为就是多项式回归的一些项。区别于线性回归的项。
叫做逻辑连接
特征组合本质上可以表达更加丰富准确的信息。
特征组合可以使得线性回归模型可以拟合非线性数据。
FTRL优化算法 my_optimizer = tf.train.FtrlOptimizer(learning_rate=learning_rate)
将离散特征(字符串、枚举、整数)进行one hot 编码
可以将连续特征分桶,进而one hot编码