问题
I currently have these data points of date vs cumulative sum. I want to predict the cumulative sum for future dates using python. What prediction method should I use?
My dates series are in this format: ['2020-01-20', '2020-01-24', '2020-01-26', '2020-01-27', '2020-01-30', '2020-01-31'] dtype='datetime64[ns]'
- I tried spline but seems like spline can't handle date-time series
I tried Exponential Smoothing for time series forecasting but the result is incorrect. I don't understand what predict(3) means and why it returns the predicted sum for dates I already have. I copied this code from an example. Here's my code for exp smoothing:
fit1 = ExponentialSmoothing(date_cumsum_df).fit(smoothing_level=0.3,optimized=False) fcast1 = fit1.predict(3) fcast1 2020-01-27 1.810000 2020-01-30 2.467000 2020-01-31 3.826900 2020-02-01 5.978830 2020-02-02 7.785181 2020-02-04 9.949627 2020-02-05 11.764739 2020-02-06 14.535317 2020-02-09 17.374722 2020-02-10 20.262305 2020-02-16 22.583614 2020-02-18 24.808530 2020-02-19 29.065971 2020-02-20 39.846180 2020-02-21 58.792326 2020-02-22 102.054628 2020-02-23 201.038240 2020-02-24 321.026768 2020-02-25 474.318737 2020-02-26 624.523116 2020-02-27 815.166181 2020-02-28 1100.116327 2020-02-29 1470.881429 2020-03-01 1974.317000 2020-03-02 2645.321900 2020-03-03 3295.025330 2020-03-04 3904.617731
What method will be best suited for the sum values prediction that seems to be exponentially increasing? Also I'm pretty new to data science with python so go easy on me. Thanks.
回答1:
Exponential Smoothing only works for data without any missing time series values. I'll show you forecasting of your data +5 days into future for your three methods mentioned:
- Exponential Fit (your guess "seems to be exponentially increasing")
- Spline interpolation
- Exponential Smoothing
Note: I got your data by data-thiefing it from your plot and saved the dates to dates
and the data values to values
import pandas as pd
import numpy as np
from statsmodels.tsa.holtwinters import ExponentialSmoothing
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from scipy.optimize import curve_fit
from scipy.interpolate import splrep, splev
df = pd.DataFrame()
# mdates.date2num allows functions like curve_fit and spline to digest time series data
df['dates'] = mdates.date2num(dates)
df['values'] = values
# Exponential fit function
def exponential_func(x, a, b, c, d):
return a*np.exp(b*(x-c))+d
# Spline interpolation
def spline_interp(x, y, x_new):
tck = splrep(x, y)
return splev(x_new, tck)
# define forecast timerange (forecasting 5 days into future)
dates_forecast = np.linspace(df['dates'].min(), df['dates'].max() + 5, 100)
dd = mdates.num2date(dates_forecast)
# Doing exponential fit
popt, pcov = curve_fit(exponential_func, df['dates'], df['values'],
p0=(1, 1e-2, df['dates'][0], 1))
# Doing spline interpolation
yy = spline_interp(df['dates'], df['values'], dates_forecast)
So far straight forward (except of the mdates.date2num
function). Since you got missing data you have to use spline interpolation on your actual data to fill missing time spots with interpolated data
# Interpolating data for exponential smoothing (no missing data in time series allowed)
df_interp = pd.DataFrame()
df_interp['dates'] = np.arange(dates[0], dates[-1] + 1, dtype='datetime64[D]')
df_interp['values'] = spline_interp(df['dates'], df['values'],
mdates.date2num(df_interp['dates']))
series_interp = pd.Series(df_interp['values'].values,
pd.date_range(start='2020-01-19', end='2020-03-04', freq='D'))
# Now the exponential smoothing works fine, provide the `trend` argument given your data
# has a clear (kind of exponential) trend
fit1 = ExponentialSmoothing(series_interp, trend='mul').fit(optimized=True)
You can plot the three methods and see how their prediction for the upcoming five days is
# Plot data
plt.plot(mdates.num2date(df['dates']), df['values'], 'o')
# Plot exponential function fit
plt.plot(dd, exponential_func(dates_forecast, *popt))
# Plot interpolated values
plt.plot(dd, yy)
# Plot Exponential smoothing prediction using function `forecast`
plt.plot(np.concatenate([series_interp.index.values, fit1.forecast(5).index.values]),
np.concatenate([series_interp.values, fit1.forecast(5).values]))
Comparison of all three methods shows that you have been right choosing exponential smoothing. It looks way better in forecasting the future five days than the other two methods
Regarding your other question
I don't understand what predict(3) means and why it returns the predicted sum for dates I already have.
ExponentialSmoothing.fit()
returns a statsmodels.tsa.holtwinters.HoltWintersResults Object which has two function you can use fore prediction/forecasting of values: predict and forecast:
predict
takes a start
and end
observation of your data and applies the ExponentialSmoothing model to the corresponding date values. For predicting values into the future you have to specify an end
parameter which is in the future
>> fit1.predict(start=np.datetime('2020-03-01'), end=np.datetime64('2020-03-09'))
2020-03-01 4240.649526
2020-03-02 5631.207307
2020-03-03 5508.614325
2020-03-04 5898.717779
2020-03-05 6249.810230
2020-03-06 6767.659081
2020-03-07 7328.416024
2020-03-08 7935.636353
2020-03-09 8593.169945
Freq: D, dtype: float64
In your example predict(3)
(which equals predict(start=3)
predicts the values based on your dates starting with the third date and without any forecasting.
forecast()
does only forecasting. You pass simply the number of observation you want to forecast into the future.
>> fit1.forecast(5)
2020-03-05 6249.810230
2020-03-06 6767.659081
2020-03-07 7328.416024
2020-03-08 7935.636353
2020-03-09 8593.169945
Freq: D, dtype: float64
Since both functions are based on the same ExponentialSmoothing.fit
model, their values are equal for equal dates.
来源:https://stackoverflow.com/questions/60556547/exponentialsmoothing-what-prediction-method-to-use-for-this-date-plot