中国财政收入时间序列分析（二）

这是我的第一篇博客，就献给了时间序列大作业了。为什么第一篇是二呢，因为对于时间序列分析，有传统模型的做法也有降低时间属性重要度，通过特征工程构造其他特征的做法。而我在项目中负责其他方法，正好把论文写完了，那么我就先写这个吧。此文中各模型不涉及推导，只有相关的code和整个流程，具体各模型推导以及实现我将在以后的博客更新。
可能第一篇写的格式还很奇怪。。。markdown也没怎么练习过，emmm不过没关系至少我开始写了hhh
如果code不好或者有bug或者有其他的想法也麻烦写一下哦，感谢！
数据可在http://data.stats.gov.cn/easyquery.htm?cn=A01找到

对于传统模型，也即我们常说的，AR,MA,ARMA,ARIMA,GARCH等模型，目前在数据挖掘算法盛行的时代，优缺点逐渐显露。传统模型的确在过去的时间里不断发展，完善形成了一种处理时间序列较好并且解释力较足的方法。并且在只有一元时间序列特征的情况下拟合效果与预测效果能达到较好，在多元时间序列特征的情况下能建立VAR,VMA等模型或通过检验协整关系进而建立ECM模型拟合短期波动最后进行因果关系检验来验证整个模型的合理性。这些是对于线性时间序列，也即Xt能被过去误差所线性表示。而对于非线性时间序列，又有一系列模型如TAR,SETAR,Markov区制转移模型等。这些传统线性与非线性时间序列模型使得过去我们能够对时间序列有较好的处理建模，并通过分析来获得较好的结果来帮助我们的决策。
然而，除此之外我们还能考虑一些不在传统模型之内的方法。传统模型有一个苛刻的要求即数据需要平稳的，即使是宽平稳要求也较为苛刻，我们往往需要通过s阶差分，h步差分以及log或BoxCox变换来使得数据变为平稳。但差分与变换会造成原有数据一定程度上的损失，虽然有时候损失程度可能不大，但始终没有用到全部的数据信息。虽然Cramer分解定理告诉我们对于任何一个时间序列我们都可将之分解为由多项式决定的确定性趋势和平稳的零均值误差成分。

x_{t} = μ_{t} + ϵ_{t}

μ_{t} = \sum_{j = 0}^{d} β_{j} t^{j}

ϵ_{t} = ψ (B) a (t)

但是对于多项式确定性趋势的拟合以及零均值误差的拟合较为困难，对于多项式拟合易造成过拟合或欠拟合，并且这种分解方式也不易于传统模型处理。并且，对于缺失值，传统模型需要先对缺失值进行填补，这很大程度上损害了数据的可靠性。传统模型还缺乏灵活性，其仅在于构建数据中的临时依赖关系，这种模型过于不够灵活，很难让使用者引入问题的背景知识，或者一些有用的假设。
于是，我们提出用一些新兴的模型和相关数据挖掘算法，通过构建相应的特征工程来对时间序列数据进行拟合，来探索其他方法在时间序列分析上的可行性。

Prophet

Prophet 是Facebook的一个开源项目，又名“先知”，旨在让没有学过时间序列的人员也能通过prophet进行时间序列分析。Prophet没有平稳时间序列的前提假设，它将时间序列分解为三部分增长趋势，非周期变化，节假日因素以及高斯白噪声。
Prophet拥有R与Python的开源代码，在GitHub上搜索prophet可查看其文档并获取相关安装库的提示。Prophet要求数据呈两列，一列为‘ds’，一列为‘y’，并且‘ds’为日期格式，否则在搭建模型时会报错。在此我们已搭建好了prophet模型，并且通过观察数据，可以发现在每年的元旦和国庆财政收入都会异常增多，因此我们认为每年元旦和国庆具有节假日影响，并传入prophet模型。
以下为R语言code：

df <- read.csv('revenue.csv') df$ds <- as.Date(df$ds,'%Y/%m/%d') library(prophet) library(dplyr) new_year <- data_frame(   holiday = 'new_year',   ds = seq.Date(from=as.Date('2010-01-01',format='%Y-%m-%d'),                 to=as.Date('2016-01-01',format='%Y-%m-%d'),by='year'),   lower_window = 0,   upper_window = 1 ) n national <- data_frame(   holiday = 'national',   ds = seq.Date(from=as.Date('2010-10-01',format='%Y-%m-%d'),                 to=as.Date('2018-10-01',format='%Y-%m-%d'),by='year'),   lower_window = 0,   upper_window = 1 ) holidays = bind_rows(new_year,national) m <- prophet(df[,1:2],holidays=holidays,holidays.prior.scale = 1) future <- make_future_dataframe(m,periods=365) forecast <- predict(m,future) yhat = forecast[c('ds', 'yhat', 'yhat_lower', 'yhat_upper','holidays')] plot(m,forecast) prophet_plot_components(m, forecast)

以下代码均为Python
首先我们需要先了解数据集财政收入的数据分布情况。下图左图为原数据集的分布，明显为右偏分布。于是对数据集做一个log变换，右图为log变换后数据分布，呈现双峰分布。

import pandas as pd  import numpy as np import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline data = pd.read_csv(file,encoding='gbk') data['ds'] = pd.to_datetime(data['ds']) plt.figure(figsize=(15,8)) plt.subplot(121) plt.hist(bins=10,x=data['y']) plt.title('y') plt.subplot(122) plt.hist(bins=45,x=np.log(data['y'])) plt.title('logy') plt.show()

再观察月份对于财政收入的影响，通过对数据集按月份groupby分组画出盒装图可得到其分布。容易看到除去一月份的异常值，各月份的分布近似相同，可以认为月份对于财政收入影响较小，可不考虑构造相关特征。

j = 1 data['month'] = np.zeros(data.shape[0]) for i in range(data.shape[0]):     if(j>12):         j = 1     data.iloc[i,4] = j     j+=1 data.boxplot(by=['month'],column='y') plt.xlabel('month') plt.ylabel('y') plt.show() data=data.drop('month',axis=1)

通过对原数据集分布的观察，我们得出了一些关于特征工程的思路。首先我们通过log变换来降低异方差。由于数据具有明显的周期性且d=12，于是对log变换后的数据集取12步差分。由于时间序列通常存在自相关性，过去值对于未来预测具有十分大的帮助，于是通过观察ACF图来确定滞后阶数来构造特征。分别对两列特征进行以上操作观察结果。

对于两列特征，在进行log变换12步差分后，呈现6阶自相关性，在之后近似认为处于误差线内不具有相关性，于是对整个数据集构造滞后6阶数据集，由于12步差分以及滞后6阶，最终数据集从1999-07-01开始至2018-04-01。
以下为数据准备的code

j = 1 data['month'] = np.zeros(data.shape[0]) for i in range(data.shape[0]):     if(j>12):         j = 1     data.iloc[i,4] = j     j+=1 data.boxplot(by=['month'],column='y') plt.xlabel('month') plt.ylabel('y') plt.show() data=data.drop('month',axis=1) num = 214 k = 13 x_train = X.iloc[:num,:k] y_train = X.iloc[:num,k] x_test = X.iloc[num:,:k] y_test = X.iloc[num:,k] y = X.iloc[:,k]

对于回归问题，在数据挖掘算法中有许多类回归算法，如多元线性回归，支持向量回归等。由于非线性性，我们采用树结构回归算法，通过树结点来构造多个回归方程来训练数据集得到相应的模型，以下讨论两类树回归算法。

Regression Tree

from sklearn.tree import DecisionTreeRegressor dtc = DecisionTreeRegressor() dtc.fit(x_train,y_train)

然后我们需要定义一个函数方便把结果plot出来。

def y_plot(Y0,Y,ts):     plt.figure(figsize=(15,8))     plt.subplot(121)     plt.title('All data plot',fontsize=15)     plt.plot(ts.index[18:-12],Y0,color='green',label='fitted data')     plt.plot(ts.index[-12:],Y,color='red',label='predict data')     plt.plot(np.exp(ts),label='true data')     plt.legend()      plt.subplot(122)     plt.title('12 steps predict plot',fontsize=15)     plt.plot(ts.index[-12:],Y,color='red',label='predict data')     plt.plot(ts.index[-12:],np.exp(ts)[-12:],label='true data')     plt.legend()     sns.plt.show()

以下为结果

Y0 = dtc.predict(x_train) Y = dtc.predict(x_test) ##差分还原 Y0 = np.exp(Y0 + ts[6:-24].values) Y = np.exp(Y + ts[-24:-12].values) y_plot(Y0,Y,ts)

可以观察到训练集上拟合效果非常好，而在测试集上效果也不差，对于异常点拟合的效果不太好。

Gradient Boosting Decision Tree

以下会利用GridsearchCV调参，调参因人而异，全部随缘（滑稽）

from sklearn.ensemble import GradientBoostingRegressor p1 = {     'n_estimators':list(range(1,50)) } s1 = GridSearchCV(GradientBoostingRegressor(learning_rate=0.05),param_grid=p1) s1.fit(x_train,y_train) s1.best_params_  ##'n_estimators':27

p2 = {     'learning_rate':[0.05,0.1,0.2],     'min_samples_split':list(range(2,20,3)),      'min_samples_leaf':list(range(2,20,3)) } s2 = GridSearchCV(GradientBoostingRegressor(n_estimators=27,learning_rate=0.05),param_grid=p2) s2.fit(x_train,y_train) s2.best_params_ ##'learning_rate': 0.05, 'min_samples_leaf': 11, 'min_samples_split': 14

gbdt = GradientBoostingRegressor(n_estimators=500,learning_rate=0.01,                                 min_samples_leaf=11,min_samples_split=14) gbdt.fit(x_train,y_train) Y0 = gbdt.predict(x_train) Y = gbdt.predict(x_test) ##差分还原 Y0 = np.exp(Y0 + ts[6:-24].values) Y = np.exp(Y + ts[-24:-12].values)  y_plot(Y0,Y,ts)

对于训练一颗树来拟合模型，易造成过拟合或欠拟合的问题，并且也无法更好的导出特征重要性。基于Bagging算法与Boosting算法，分别对回归树以及GBDT构造森林结构可得到随机森林以及XGBoost算法。

Random Forest

from sklearn.ensemble import RandomForestRegressor param1 = {     'n_estimators':list(range(1,50)) } from sklearn.grid_search import GridSearchCV search1 = GridSearchCV(estimator = RandomForestRegressor(bootstrap=True),param_grid=param1) search1.fit(x_train,y_train) search1.best_params_ ## 'n_estimators': 27

param2 = {     'max_depth':list(range(3,14,2)),     'min_samples_split':list(range(2,10,2)) } search2 = GridSearchCV(estimator = RandomForestRegressor(n_estimators=35),param_grid=param2) search2.fit(x_train,y_train) search2.best_params_ ## 'max_depth': 3, 'min_samples_split': 8

param3 = {     'min_samples_split':list(range(2,20,3)),      'min_samples_leaf':list(range(2,20,3)) } search3 = GridSearchCV(estimator = RandomForestRegressor(n_estimators=35,max_depth=3),param_grid=param3) search3.fit(x_train,y_train) search3.best_params_ ## 'min_samples_leaf': 17, 'min_samples_split': 11

param4 = {     'max_features':list(range(1,7)) } search4 = GridSearchCV(estimator = RandomForestRegressor(n_estimators=35,max_depth=3,                                                         min_samples_split=17,min_samples_leaf=2),param_grid=param4) search4.fit(x_train,y_train) search4.best_params_ ## 'max_features': 6

rf = RandomForestRegressor(bootstrap=False,max_depth=3,n_estimators=500,min_samples_leaf=17,                           min_samples_split=2,max_features=6) rf.fit(x_train,y_train) Y0 = rf.predict(x_train) Y = rf.predict(x_test) ##差分还原 Y0 = np.exp(Y0 + ts[6:-24].values) Y = np.exp(Y + ts[-24:-12].values)  y_plot(Y0,Y,ts)

XGBoost

xgboost 调参这里就省略了。。实在太难调了。。。

import xgboost as xgb from xgboost.sklearn import XGBRegressor import operator  def xgbTraining(x_train,y_train,x_test):      dtrain=xgb.DMatrix(x_train,y_train)     dtest1=xgb.DMatrix(x_train)     dtest2=xgb.DMatrix(x_test)     param = {}     param['n_estimators']=1000     param['eta'] = 0.05     param['max_depth'] = 5     param['mmin_child_weight'] = 6     param['subsample'] = 0.8     param['colsample_bytree'] = 0.8     #param['reg_alpha']=0.1     param['silent'] = 1      alg = xgb.train(param,dtrain,5000)     Y0 = alg.predict(dtest1)     Y = alg.predict(dtest2)     ##差分还原     Y0 = np.exp(Y0 + ts[6:-24].values)     Y = np.exp(Y + ts[-24:-12].values)      importance = alg.get_fscore()       importance = sorted(importance.items(), key=operator.itemgetter(1))        df = pd.DataFrame(importance, columns=['feature', 'fscore'])       df['fscore'] = df['fscore'] / df['fscore'].sum()       return Y0,Y,df  Y0,Y,df = xgbTraining(x_train,y_train,x_test) y_plot(Y0,Y,ts)

以下为特征重要性输出图

plt.figure(figsize=(8,6)) plt.bar(df['feature'],df['fscore']) plt.title('XGBoost feature importance') plt.show()

我也不知道为什么他最重要的竟然是m_t-6，即进出口总值12步差分后滞后6阶的值。。。

除去回归类算法，对于数值型变量的预测，神经网络是目前非线性模型中表现效果较为优异的算法。这里采用BP神经网络以及LSTM长短期记忆网络来搭建模型，分别比较模型效果。

BP Neural Network

import warnings import time import matplotlib.pyplot as plt import seaborn as sns from numpy import newaxis from keras.layers.core import Dense, Activation, Dropout from keras.layers.recurrent import LSTM from keras.models import Sequential  warnings.filterwarnings("ignore") def build_model(layers,k):     model = Sequential()        model.add(Dense(units=layers[0],input_dim=k))     for i in range(1,len(layers)):         model.add(Dense(units=layers[i]))         model.add(Dropout(0.2))         model.add(Activation('relu'))      model.add(Dense(units=1))      model.compile(loss="mse", optimizer="adam")     print('Compiling...')     return model

epochs = 1000 layers = [100,200,400,800,1000,800,400,200,100,50] model = build_model(layers,k) model.fit(x_train,y_train,batch_size=64,epochs=epochs)  Y0 = model.predict(x_train) Y = model.predict(x_test) ##差分还原 Y0 = np.exp(Y0[:,0] + ts[6:-24].values) Y = np.exp(Y[:,0] + ts[-24:-12].values)  y_plot(Y0,Y,ts)

LSTM

对于LSTM的训练，则无需其余的特征，只需要确定好look_back的阶数就行。但前提需要对数据集进行归一化。由于此处我也对LSTM的使用有一些疑惑，这里先不放相应的code，只展示结果，关于LSTM以后将在新的博客中讨论。

反正结果也不咋地hhh

通过在每个模型下调用以下code可输出各模型的误差

from sklearn.metrics import mean_squared_error from sklearn.metrics import mean_absolute_error train_rmse = np.sqrt(mean_squared_error(Y0,np.exp(ts[18:num+18]))) test_bias = mean_absolute_error(Y,np.exp(y_test+ts[-24:-12].values)) test_rmse = np.sqrt(mean_squared_error(Y,np.exp(y_test+ts[-24:-12].values))) print('Training rmse:%d \nTesting mean bias:%d \nTesting rmse:%d '%(train_rmse,test_bias,test_rmse))

最终的误差如下表

Model	Training RMSE	Testing Mean Bias	Testing RMSE
Regression Tree	20	1228	1753
GBDT	329	674	863
Random Forest	437	665	855
XGBoost	3	841	997
NN	85	825	990
LSTM	608	1568	2124

再对各个树类模型做个10折交叉验证

from sklearn.cross_validation import cross_val_score from sklearn.cross_validation import KFold from sklearn.metrics import make_scorer from sklearn.metrics import mean_squared_error m = make_scorer(mean_squared_error) kf = KFold(n=X.shape[0],n_folds=10)  dtc_cv = cross_val_score(DecisionTreeRegressor(),X.iloc[:,:k],X.iloc[:,k],cv=kf,scoring=m).mean() rf_cv = cross_val_score(RandomForestRegressor(),X.iloc[:,:k],X.iloc[:,k],cv=kf,scoring=m).mean() gbdt_cv = cross_val_score(GradientBoostingRegressor(),X.iloc[:,:k],X.iloc[:,k],cv=kf,scoring=m).mean() xgb_cv = cross_val_score(XGBRegressor(),X.iloc[:,:k],X.iloc[:,k],cv=kf,scoring=m).mean() print('''DTC cv mean score:%.4f \nRF cv mean score:%.4f        \nGBDT cv mean score:%.4f \nXGBoost cv mean score:%.4f'''      %(dtc_cv,rf_cv,gbdt_cv,xgb_cv))