I need to be able to create a python function for forecasting based on linear regression model with confidence bands on time-series data:
The function needs to take an a
Scikit is a great module for python
The X and Y vars must be separated into two arrays where if you were to plot them (X,Y) the index from one would match with the other array.
So I guess in your time series data you would separate the X and Y values as follows:
null = None
time_series = [{"target": "average", "datapoints": [[null, 1435688679], [34.870499801635745, 1435688694], [null, 1435688709], [null, 1435688724], [null, 1435688739], [null, 1435688754], [null, 1435688769], [null, 1435688784], [null, 1435688799], [null, 1435688814], [null, 1435688829], [null, 1435688844], [null, 1435688859], [null, 1435688874], [null, 1435688889], [null, 1435688904], [null, 1435688919], [null, 1435688934], [null, 1435688949], [null, 1435688964], [null, 1435688979], [38.180000209808348, 1435688994], [null, 1435689009], [null, 1435689024], [null, 1435689039], [null, 1435689054], [null, 1435689069], [null, 1435689084], [null, 1435689099], [null, 1435689114], [null, 1435689129], [null, 1435689144], [null, 1435689159], [null, 1435689174], [null, 1435689189], [null, 1435689204], [null, 1435689219], [null, 1435689234], [null, 1435689249], [null, 1435689264], [null, 1435689279], [30.79849989414215, 1435689294], [null, 1435689309], [null, 1435689324], [null, 1435689339], [null, 1435689354], [null, 1435689369], [null, 1435689384], [null, 1435689399], [null, 1435689414], [null, 1435689429], [null, 1435689444], [null, 1435689459], [null, 1435689474], [null, 1435689489], [null, 1435689504], [null, 1435689519], [null, 1435689534], [null, 1435689549], [null, 1435689564]]}]
# assuming the time series is the X axis value
input_X_vars = []
input_Y_vars = []
for pair in time_series[0]["datapoints"]:
if (pair[0] != None and pair[1] != None):
input_X_vars.append(pair[1])
input_Y_vars.append(pair[0])
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
regr = linear_model.LinearRegression()
regr.fit(input_X_vars, input_Y_vars)
test_X_vars = [1435688681, 1435688685]
results = regr.predict(test_X_vars)
forecast_append = {"target": "Lower", "datapoints": results}
time_series.append(forecast_append)
I set null as None as the 'null' keyword is parsed as simply a variable in python.
Note: this is not a detailed canonical answer, but references to available libraries that apply to your domain (statistical models).
For python, you could use:
scipy.stats.linregress
)sklearn
: here is the documentation.There are some good articles:
statsmodels
and sklearn
.In the text of your question, you clearly state that you would like upper and lower bounds on your regression output, as well as the output prediction. You also mention using Holt-Winters algorithms for forecasting in particular.
The packages suggested by other answerers are useful, but you might note
that sklearn
LinearRegression does not give you error bounds "out of
the box", statsmodels does not provide Holt-Winters right now.
Therefore, I suggest try using this implementation of Holt-Winters.
Unfortunately its license is unclear, so I can't reproduce it here in
full. Now, I'm not sure whether you actually want Holt-Winters
(seasonal) prediction, or Holt's linear exponential smoothing
algorithm. I'm guessing the latter given the title of the post. Thus,
you can use the linear()
function of the linked library. The
technique is described in detail here for interested readers.
In the interests of not providing a link only answer - I'll describe the main features here. A function is defined that takes the data i.e.
def linear(x, fc, alpha = None, beta = None):
x
is the data to be fit, fc
is the number of timesteps that you want
to forecast, alpha and beta take their usual Holt-Winters meanings:
roughly a parameter to control the amount of smoothing to the "level"
and to the "trend" respectively. If alpha
or beta
are not
specified, they are estimated using scipy.optimize.fmin_l_bfgs_b
to minimise the RMSE.
The function simply applies the Holt algorithm by looping through the existing data points and then returns the forecast as follows:
return Y[-fc:], alpha, beta, rmse
where Y[-fc:]
are the forecast points, alpha
and beta
are the
values actually used and rmse
is the root mean squared error.
Unfortunately, as you can see, there are no upper or lower confidence
intervals. By the way - we should probably refer to them as prediction
intervals.
Holt's algorithm and Holt-Winters algorithm are exponential smoothing techniques and finding confidence intervals for predictions generated from them is a tricky subject. They have been referred to as "rule of thumb" methods and, in the case of the Holt-Winters multiplicative algorithm, without "underlying statistical model". However, the final footnote to this page asserts that:
It is possible to calculate confidence intervals around long-term forecasts produced by exponential smoothing models, by considering them as special cases of ARIMA models. (Beware: not all software calculates confidence intervals for these models correctly.) The width of the confidence intervals depends on (i) the RMS error of the model, (ii) the type of smoothing (simple or linear); (iii) the value(s) of the smoothing constant(s); and (iv) the number of periods ahead you are forecasting. In general, the intervals spread out faster as α gets larger in the SES model and they spread out much faster when linear rather than simple smoothing is used.
We see here that an ARIMA(0,2,2) model is equivalent to a Holt linear model with additive errors
You indicate in comments that you "can easily do this in R". I
guess you may be used to the holt()
function provided by the
forecast
package in R
and therefore expecting such intervals. In
which case - you can adapt the python library to give them to you on the
same basis.
Looking at the R holt code, we can see that it returns an object
based on forecast(ets(...)
. Under the hood - this eventually calls to
this function class1, which returns a mean mu
and variance
var
(as well as cj
which I have to confess I do not understand).
The variance is used to calculate the upper and lower bounds here.
To do something similar in Python - we would need to produce something
similar to the class1
R function that estimates the variance of each
prediction. This function takes the residuals found in model fitting and
multiplies them by a factor at each time step to get the variance at
that timestep. In the particular case of the linear Holt's algorithm, the factor is the cumulative sum of alpha + k*beta
where k
is the number of timesteps' prediction. Once you have that
variance at each prediction point, treat the errors as normally
distributed and get the X% value from the normal distribution.
Here's an idea how to do this in Python (using the code I linked as your linear function)
#Copy, import or reimplement the RMSE and linear function from
#https://gist.github.com/andrequeiroz/5888967
#factor in case there are not 1 timestep per day - in your case
#assuming the timesteps are UTC epoch - I think they're 5 min
# spaced i.e. 288 per day
timesteps_per_day = 288
# Note - big assumption here - your known data will be regular in time
# i.e. timesteps_per_day observations per day. From the timestamps this seems valid.
# if you can't guarantee that - you'll need to interpolate the data
def holt_predict(data, timestamps, forecast_days, pred_error_level = 0.95):
forecast_timesteps = forecast_days*timesteps_per_day
middle_predictions, alpha, beta, rmse = linear(data,int(forecast_timesteps))
cum_error = [beta+alpha]
for k in range(1,forecast_timesteps):
cum_error.append(cum_error[k-1] + k*beta + alpha)
cum_error = np.array(cum_error)
#Use some numpy multiplication to get the intervals
var = cum_error * rmse**2
# find the correct ppf on the normal distribution (two-sided)
p = abs(scipy.stats.norm.ppf((1-pred_error_level)/2))
interval = np.sqrt(var) * p
upper = middle_predictions + interval
lower = middle_predictions - interval
fcast_timestamps = [timestamps[-1] + i * 86400 / timesteps_per_day for i in range(forecast_timesteps)]
ret_value = []
ret_value.append({'target':'Forecast','datapoints': zip(middle_predictions, fcast_timestamps)})
ret_value.append({'target':'Upper','datapoints':zip(upper,fcast_timestamps)})
ret_value.append({'target':'Lower','datapoints':zip(lower,fcast_timestamps)})
return ret_value
if __name__=='__main__':
import numpy as np
import scipy.stats
from math import sqrt
null = None
data_in = [{"target": "average", "datapoints": [[null, 1435688679],
[34.870499801635745, 1435688694], [null, 1435688709], [null,
1435688724], [null, 1435688739], [null, 1435688754], [null, 1435688769],
[null, 1435688784], [null, 1435688799], [null, 1435688814], [null,
1435688829], [null, 1435688844], [null, 1435688859], [null, 1435688874],
[null, 1435688889], [null, 1435688904], [null, 1435688919], [null,
1435688934], [null, 1435688949], [null, 1435688964], [null, 1435688979],
[38.180000209808348, 1435688994], [null, 1435689009], [null,
1435689024], [null, 1435689039], [null, 1435689054], [null, 1435689069],
[null, 1435689084], [null, 1435689099], [null, 1435689114], [null,
1435689129], [null, 1435689144], [null, 1435689159], [null, 1435689174],
[null, 1435689189], [null, 1435689204], [null, 1435689219], [null,
1435689234], [null, 1435689249], [null, 1435689264], [null, 1435689279],
[30.79849989414215, 1435689294], [null, 1435689309], [null, 1435689324],
[null, 1435689339], [null, 1435689354], [null, 1435689369], [null,
1435689384], [null, 1435689399], [null, 1435689414], [null, 1435689429],
[null, 1435689444], [null, 1435689459], [null, 1435689474], [null,
1435689489], [null, 1435689504], [null, 1435689519], [null, 1435689534],
[null, 1435689549], [null, 1435689564]]}]
#translate the data. There may be better ways if you're
#prepared to use pandas / input data is proper json
time_series = data_in[0]["datapoints"]
epoch_in = []
Y_observed = []
for (y,x) in time_series:
if y and x:
epoch_in.append(x)
Y_observed.append(y)
#Pass in the number of days to forecast
fcast_days = 30
res = holt_predict(Y_observed,epoch_in,fcast_days)
data_out = data_in + res
#data_out now holds the data as you wanted it.
#Optional plot of results
import matplotlib.pyplot as plt
plt.plot(epoch_in,Y_observed)
m,tstamps = zip(*res[0]['datapoints'])
u,tstamps = zip(*res[1]['datapoints'])
l,tstamps = zip(*res[2]['datapoints'])
plt.plot(tstamps,u, label='upper')
plt.plot(tstamps,l, label='lower')
plt.plot(tstamps,m, label='mean')
plt.show()
N.B. The output I've given adds points as tuple
type into your object. If you really need list
, then replace zip(upper,fcast_timestamps)
with map(list,zip(upper,fcast_timestamps))
where the code adds upper
, lower
and Forecast
dicts to the result.
This code is for the particular case of the Holt's linear algorithm - it is not a generic way to calculate correct prediction intervals.
Your sample input data seems to have a lot of null
and only 3 genuine
data points. This is highly unlikely to be a good basis for doing
timeseries prediction - especially as they all seem to be with 15 minutes and you're trying to forecast up to 3 months!. Indeed - if you feed that data into the R
holt()
, it will say:
You've got to be joking. I need more data!
i'm assuming you have a larger dataset to test on. I tried the code above on the stock market opening prices for 2015 and it seemed to give reasonable results (see below).
You may think the prediction intervals look a little wide. This blog from the author of the R forecast module implies that is intentional, though :
"confidence intervals for the mean are much narrower than prediction intervals"
I've made a linear regression using gradient descent in Python, perhaps that can help you. After the data are initialized, the loss function is created, which is ordinary least squares (OLS). Then the gradient function is created, and this is used in an iterative procedure to find the optimal parameters. These can then be used for forecasting. Here's a working example about the prediction of the grade obtained, given the number of books and the number of hours in class.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import *
from random import random, seed
from matplotlib import cm
# Initialize datapoints [# Books, hours in class], on the basis of which we want to make a prediction:
#x = np.array([[2104,3],[1600,3],[2400,3],[1416,2],[300,4],[2001,3]])
x = np.array([[0, 9],
[1, 15],
[0, 10],
[2, 16],
[4, 10],
[4, 20],
[1, 11],
[4, 20],
[3, 15],
[0, 15],
[2, 8],
[1, 13],
[4, 18],
[1, 10],
[0, 8],
[1, 10],
[3, 16],
[0, 11],
[1, 19],
[4, 12],
[4, 11],
[0, 19],
[2, 15],
[3, 15],
[1, 20],
[0, 6],
[3, 15],
[3, 19],
[2, 14],
[2, 13],
[3, 17],
[2, 20],
[2, 11],
[3, 20],
[4, 20],
[4, 20],
[3, 9],
[1, 8],
[2, 16],
[0, 10]])
# Initialize y, the grade obtained
y = np.array([45,
57,
45,
51,
65,
88,
44,
87,
89,
59,
66,
65,
56,
47,
66,
41,
56,
37,
45,
58,
47,
64,
97,
55,
51,
61,
69,
79,
71,
62,
87,
54,
43,
92,
83,
94,
60,
56,
88,
62])
#Initialize starting point for theta, the parameter we want to optimize:
#theta = np.array([90,0.1,-8])
theta = np.array([1,1,1])
# Initialize h_theta(x):
def h(x, theta):
return theta[0] + theta[1]*x[:,0] + theta[2]*x[:,1]
# Initialize the loss function (ordinary least squares), J(theta):
def lossf(theta):
loss = np.sum((h(x,theta)-y)*(h(x,theta)-y))
return loss
'''
for i in range(x.shape[0]):
lossIncrement = (h(x[[i]],theta) - y[i])*(h(x[[i]],theta) - y[i])
loss = loss + lossIncrement
return 0.5*loss
'''
# Define the gradient of the loss function for any parameters theta, use vector calculations here instead of for loops:
def gradlossf(theta):
# dJ/dTheta0:
d1 = np.sum(h(x,theta)-y)
d2 = np.sum((h(x,theta)-y)*x[:,0])
d3 = np.sum((h(x,theta)-y)*x[:,1])
return np.array([d1,d2,d3])
epsilon = 1e-2
max_iter = 1000
norm = np.linalg.norm(gradlossf(theta))
alpha0 = 1
iter = 0
beta = 1e-4
tolerance = 1e-6
tau = 0.5
# Start iterative procedure:
while iter < max_iter and norm > epsilon:
print theta
alpha = alpha0
alpha_found = False
while alpha_found is False:
xpoint = theta - alpha*gradlossf(theta)
val1 = lossf(xpoint)
val2 = lossf(theta)
if val1<val2+tolerance:
alpha_found = True
print 'We choose alpha =',alpha
else:
alpha = tau*alpha
iter = iter + 1
theta = xpoint
print 'New value of loss function is: ', val1
# THE CORRECT VALUES: [37.379, 4.037, 1.283] (from SPSS) Residual: 7306.230
# Plot datapoints
fig = plt.figure()
plt.hold(True)
x1_surf=np.arange(0, 5, 0.1) # generate a mesh
x2_surf=np.arange(5, 25, 0.1)
x1_surf, x2_surf = np.meshgrid(x1_surf, x2_surf)
z_surf = theta[0] + theta[1]*x1_surf + theta[2]*x2_surf # ex. function, which depends on x and y
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(x1_surf, x2_surf, z_surf, cmap=cm.hot); # plot a 3d surface plot
ax.scatter(x[:,0],x[:,1],y)
ax.set_xlabel('# Books')
ax.set_ylabel('Hours in class')
ax.set_zlabel('Grade')
plt.show()