python 时间序列预测 —— SARIMA

我只是一个虾纸丫 提交于 2020-03-07 18:30:37

SARIMA(p,d,q)(P,D,Q,s) 季节性自回归移动平均模型,结构参数有七个

AR(p) 自回归模型,即用自己回归自己。

基本假设是,当前序列值取决于其带有时滞的先前值。模型中的最大滞后记为 p

要确定初始 p,需要查看PACF图并找到最大的显着滞后,之后其他大多数滞后变得无关紧要。

MA(q) 移动平均模型,是对时间序列的误差进行建模,并假设当前误差取决于带有滞后的误差。可以在ACF图上找到初始值。

结合以上两种方法:AR(p)+MA(q)=ARMA(p,q)AR(p)+MA(q)=ARMA(p,q),就是自回归移动平均模型

剩下的参数:

I(d)

  • order of integration. This is simply the number of nonseasonal differences needed to make the series stationary. In our case, it’s just 1 because we used first differences.
    Adding this letter to the four gives us the ARIMA
    model which can handle non-stationary data with the help of nonseasonal differences. Great, one more letter to go!

S(s)

  • this is responsible for seasonality and equals the season period length of the series
    With this, we have three parameters: (P,D,Q)

P

  • order of autoregression for the seasonal component of the model, which can be derived from PACF. But you need to look at the number of significant lags, which are the multiples of the season period length. For example, if the period equals 24 and we see the 24-th and 48-th lags are significant in the PACF, that means the initial P
    should be 2.

Q

  • similar logic using the ACF plot instead.

D

  • order of seasonal integration. This can be equal to 1 or 0, depending on whether seasonal differeces were applied or not.

好了,接下来我们用 SARIMA 来建模 😏

导入包

import warnings                                  # do not disturbe mode
warnings.filterwarnings('ignore')

# Load packages
import numpy as np                               # vectors and matrices
import pandas as pd                              # tables and data manipulations
import matplotlib.pyplot as plt                  # plots
import seaborn as sns                            # more plots

from dateutil.relativedelta import relativedelta # working with dates with style
from scipy.optimize import minimize              # for function minimization

import statsmodels.formula.api as smf            # statistics and econometrics
import statsmodels.tsa.api as smt
import statsmodels.api as sm
import scipy.stats as scs

from itertools import product                    # some useful functions
from tqdm import tqdm_notebook

# Importing everything from forecasting quality metrics
from sklearn.metrics import r2_score, median_absolute_error, mean_absolute_error
from sklearn.metrics import median_absolute_error, mean_squared_error, mean_squared_log_error
# MAPE
def mean_absolute_percentage_error(y_true, y_pred): 
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    
def tsplot(y, lags=None, figsize=(12, 7), style='bmh'):
    """
        Plot time series, its ACF and PACF, calculate Dickey–Fuller test
        
        y - timeseries
        lags - how many lags to include in ACF, PACF calculation
    """
    
    if not isinstance(y, pd.Series):
        y = pd.Series(y)
        
    with plt.style.context(style):    
        fig = plt.figure(figsize=figsize)
        layout = (2, 2)
        ts_ax = plt.subplot2grid(layout, (0, 0), colspan=2)
        acf_ax = plt.subplot2grid(layout, (1, 0))
        pacf_ax = plt.subplot2grid(layout, (1, 1))
        
        y.plot(ax=ts_ax)
        p_value = sm.tsa.stattools.adfuller(y)[1]
        ts_ax.set_title('Time Series Analysis Plots\n Dickey-Fuller: p={0:.5f}'.format(p_value))
        smt.graphics.plot_acf(y, lags=lags, ax=acf_ax)
        smt.graphics.plot_pacf(y, lags=lags, ax=pacf_ax)
        plt.tight_layout()

读取数据

ads = pd.read_csv('ads.csv', index_col=['Time'], parse_dates=['Time'])

在这里插入图片描述

plt.figure(figsize=(18, 6))
plt.plot(ads.Ads)
plt.title('Ads watched (hourly data)')
plt.grid(True)
plt.show()

在这里插入图片描述

tsplot(ads.Ads, lags=60)

在这里插入图片描述

# The seasonal difference
ads_diff = ads.Ads - ads.Ads.shift(24)
tsplot(ads_diff[24:], lags=60)

在这里插入图片描述

ads_diff = ads_diff - ads_diff.shift(1)
tsplot(ads_diff[24+1:], lags=60)

在这里插入图片描述

SARIMA参数

  • p 是 4,因为PACF上 的最后一个显著滞后,此后,其他大多数都不显著。
  • d 是1,因为做了一阶差分
  • q 根据 ACF 应该是 4 左右
  • P 可能是2,因为 24 和 48 的滞后在PACF上有些重要
  • D 是 1,因为我们做了季节差分
  • Q 可能为1,ACF的第24个滞后很明显,而第48个的滞后则不是
# setting initial values and some bounds for them
ps = range(2, 5)
d=1 
qs = range(2, 5)
Ps = range(0, 2)
D=1 
Qs = range(0, 2)
s = 24 # season length is still 24

# creating list with all the possible combinations of parameters
parameters = product(ps, qs, Ps, Qs)
parameters_list = list(parameters)
len(parameters_list)  # 36

根据 AIC 调参

def optimizeSARIMA(parameters_list, d, D, s):
    """Return dataframe with parameters and corresponding AIC
        
        parameters_list - list with (p, q, P, Q) tuples
        d - integration order in ARIMA model
        D - seasonal integration order 
        s - length of season
    """
    
    results = []
    best_aic = float("inf")

    for param in tqdm_notebook(parameters_list):
        # we need try-except because on some combinations model fails to converge
        try:
            model=sm.tsa.statespace.SARIMAX(ads.Ads, order=(param[0], d, param[1]), 
                                            seasonal_order=(param[2], D, param[3], s)).fit(disp=-1)
        except:
            continue
        aic = model.aic
        # saving best model, AIC and parameters
        if aic < best_aic:
            best_model = model
            best_aic = aic
            best_param = param
        results.append([param, model.aic])

    result_table = pd.DataFrame(results)
    result_table.columns = ['parameters', 'aic']
    # sorting in ascending order, the lower AIC is - the better
    result_table = result_table.sort_values(by='aic', ascending=True).reset_index(drop=True)
    
    return result_table
warnings.filterwarnings("ignore") 
result_table = optimizeSARIMA(parameters_list, d, D, s)
'''
	 parameters	    aic
0	(2, 3, 1, 1)	3888.642174
1	(3, 2, 1, 1)	3888.763568
2	(4, 2, 1, 1)	3890.279740
3	(3, 3, 1, 1)	3890.513196
4	(2, 4, 1, 1)	3892.302849
5	(4, 3, 1, 1)	3892.322855
6	(3, 4, 1, 1)	3893.762846
7	(4, 4, 1, 1)	3894.327967
8	(2, 2, 1, 1)	3894.798147
9	(2, 3, 0, 1)	3897.170902
10	(3, 2, 0, 1)	3897.815032
11	(4, 2, 0, 1)	3899.073591
12	(3, 3, 0, 1)	3899.165271
13	(3, 4, 0, 1)	3900.500309
14	(2, 4, 0, 1)	3900.502494
15	(4, 3, 0, 1)	3901.255700
16	(4, 4, 0, 1)	3902.650501
17	(2, 2, 0, 1)	3903.905714
18	(2, 3, 1, 0)	3909.281188
19	(3, 2, 1, 0)	3909.502838
20	(4, 2, 1, 0)	3910.927759
21	(3, 3, 1, 0)	3911.192654
22	(3, 4, 1, 0)	3911.344351
23	(2, 4, 1, 0)	3911.809710
24	(4, 4, 1, 0)	3913.084053
25	(4, 3, 1, 0)	3913.409057
26	(2, 2, 1, 0)	3914.786853
27	(3, 2, 0, 0)	3925.433806
28	(2, 2, 0, 0)	3925.786566
29	(2, 3, 0, 0)	3925.879649
30	(2, 4, 0, 0)	3926.311190
31	(3, 3, 0, 0)	3927.427240
32	(4, 2, 0, 0)	3927.427417
33	(3, 4, 0, 0)	3928.376406
34	(4, 3, 0, 0)	3929.059361
35	(4, 4, 0, 0)	3930.078725
'''
# set the parameters that give the lowest AIC
p, q, P, Q = result_table.parameters[0]

best_model=sm.tsa.statespace.SARIMAX(ads.Ads, order=(p, d, q), 
                                        seasonal_order=(P, D, Q, s)).fit(disp=-1)
print(best_model.summary())
'''
                                 Statespace Model Results                                 
==========================================================================================
Dep. Variable:                                Ads   No. Observations:                  216
Model:             SARIMAX(2, 1, 3)x(1, 1, 1, 24)   Log Likelihood               -1936.321
Date:                            Sat, 07 Mar 2020   AIC                           3888.642
Time:                                    15:00:14   BIC                           3914.660
Sample:                                09-13-2017   HQIC                          3899.181
                                     - 09-21-2017                                         
Covariance Type:                              opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.7913      0.270      2.928      0.003       0.262       1.321
ar.L2         -0.5503      0.306     -1.799      0.072      -1.150       0.049
ma.L1         -0.7316      0.262     -2.793      0.005      -1.245      -0.218
ma.L2          0.5651      0.282      2.005      0.045       0.013       1.118
ma.L3         -0.1811      0.092     -1.964      0.049      -0.362      -0.000
ar.S.L24       0.3312      0.076      4.351      0.000       0.182       0.480
ma.S.L24      -0.7635      0.104     -7.361      0.000      -0.967      -0.560
sigma2      4.574e+07   5.61e-09   8.15e+15      0.000    4.57e+07    4.57e+07
===================================================================================
Ljung-Box (Q):                       43.70   Jarque-Bera (JB):                10.56
Prob(Q):                              0.32   Prob(JB):                         0.01
Heteroskedasticity (H):               0.65   Skew:                            -0.28
Prob(H) (two-sided):                  0.09   Kurtosis:                         4.00
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 1.08e+32. Standard errors may be unstable.
'''
tsplot(best_model.resid[24+1:], lags=60)

在这里插入图片描述

def plotSARIMA(series, model, n_steps):
    """Plots model vs predicted values
        
        series - dataset with timeseries
        model - fitted SARIMA model
        n_steps - number of steps to predict in the future    
    """
    
    # adding model values
    data = series.copy()
    data.columns = ['actual']
    data['sarima_model'] = model.fittedvalues
    # making a shift on s+d steps, because these values were unobserved by the model
    # due to the differentiating
    data['sarima_model'][:s+d] = np.NaN
    
    # forecasting on n_steps forward 
    forecast = model.predict(start = data.shape[0], end = data.shape[0]+n_steps)
    forecast = data.sarima_model.append(forecast)
    # calculate error, again having shifted on s+d steps from the beginning
    error = mean_absolute_percentage_error(data['actual'][s+d:], data['sarima_model'][s+d:])

    plt.figure(figsize=(15, 7))
    plt.title("Mean Absolute Percentage Error: {0:.2f}%".format(error))
    plt.plot(forecast, color='r', label="model")
    plt.axvspan(data.index[-1], forecast.index[-1], alpha=0.5, color='lightgrey')
    plt.plot(data.actual, label="actual")
    plt.legend()
    plt.grid(True)
plotSARIMA(ads, best_model, 50)

在这里插入图片描述

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!