how do you create a linear regression forecast on time series data in python

前端 未结 4 779
时光说笑
时光说笑 2021-01-31 03:37

I need to be able to create a python function for forecasting based on linear regression model with confidence bands on time-series data:

The function needs to take an a

相关标签:
4条回答
  • 2021-01-31 03:53

    Scikit is a great module for python

    The X and Y vars must be separated into two arrays where if you were to plot them (X,Y) the index from one would match with the other array.

    So I guess in your time series data you would separate the X and Y values as follows:

    null = None
    time_series = [{"target": "average", "datapoints": [[null, 1435688679], [34.870499801635745, 1435688694], [null, 1435688709], [null, 1435688724], [null, 1435688739], [null, 1435688754], [null, 1435688769], [null, 1435688784], [null, 1435688799], [null, 1435688814], [null, 1435688829], [null, 1435688844], [null, 1435688859], [null, 1435688874], [null, 1435688889], [null, 1435688904], [null, 1435688919], [null, 1435688934], [null, 1435688949], [null, 1435688964], [null, 1435688979], [38.180000209808348, 1435688994], [null, 1435689009], [null, 1435689024], [null, 1435689039], [null, 1435689054], [null, 1435689069], [null, 1435689084], [null, 1435689099], [null, 1435689114], [null, 1435689129], [null, 1435689144], [null, 1435689159], [null, 1435689174], [null, 1435689189], [null, 1435689204], [null, 1435689219], [null, 1435689234], [null, 1435689249], [null, 1435689264], [null, 1435689279], [30.79849989414215, 1435689294], [null, 1435689309], [null, 1435689324], [null, 1435689339], [null, 1435689354], [null, 1435689369], [null, 1435689384], [null, 1435689399], [null, 1435689414], [null, 1435689429], [null, 1435689444], [null, 1435689459], [null, 1435689474], [null, 1435689489], [null, 1435689504], [null, 1435689519], [null, 1435689534], [null, 1435689549], [null, 1435689564]]}]
    
     # assuming the time series is the X axis value
    
    input_X_vars = []
    input_Y_vars = []
    
    for pair in time_series[0]["datapoints"]:
        if (pair[0] != None and pair[1] != None):
            input_X_vars.append(pair[1])
            input_Y_vars.append(pair[0])
    
    
    
    import matplotlib.pyplot as plt
    import numpy as np
    from sklearn import datasets, linear_model
    
    regr = linear_model.LinearRegression()
    regr.fit(input_X_vars, input_Y_vars)
    
    test_X_vars = [1435688681, 1435688685]
    
    results = regr.predict(test_X_vars)
    
    forecast_append = {"target": "Lower", "datapoints": results}
    
    time_series.append(forecast_append)
    

    I set null as None as the 'null' keyword is parsed as simply a variable in python.

    0 讨论(0)
  • 2021-01-31 04:01

    Note: this is not a detailed canonical answer, but references to available libraries that apply to your domain (statistical models).

    For python, you could use:

    • scipy : scipy.stats.linregress
      • e.g. http://glowingpython.blogspot.com/2012/03/linear-regression-with-numpy.html (uses scipy.stats.linregress)
    • statsmodels: here are some examples.
    • sklearn: here is the documentation.

    There are some good articles:

    • http://davidcoallier.com/blog/linear-regression-from-r-to-python/, in particular it has a section about forecasting.
    • http://connor-johnson.com/2014/02/18/linear-regression-with-python/, another article, shows both using statsmodels and sklearn.
    0 讨论(0)
  • 2021-01-31 04:04

    In the text of your question, you clearly state that you would like upper and lower bounds on your regression output, as well as the output prediction. You also mention using Holt-Winters algorithms for forecasting in particular.

    The packages suggested by other answerers are useful, but you might note that sklearn LinearRegression does not give you error bounds "out of the box", statsmodels does not provide Holt-Winters right now.

    Therefore, I suggest try using this implementation of Holt-Winters. Unfortunately its license is unclear, so I can't reproduce it here in full. Now, I'm not sure whether you actually want Holt-Winters (seasonal) prediction, or Holt's linear exponential smoothing algorithm. I'm guessing the latter given the title of the post. Thus, you can use the linear() function of the linked library. The technique is described in detail here for interested readers.

    In the interests of not providing a link only answer - I'll describe the main features here. A function is defined that takes the data i.e.

     def linear(x, fc, alpha = None, beta = None):
    

    x is the data to be fit, fc is the number of timesteps that you want to forecast, alpha and beta take their usual Holt-Winters meanings: roughly a parameter to control the amount of smoothing to the "level" and to the "trend" respectively. If alpha or beta are not specified, they are estimated using scipy.optimize.fmin_l_bfgs_b to minimise the RMSE.

    The function simply applies the Holt algorithm by looping through the existing data points and then returns the forecast as follows:

     return Y[-fc:], alpha, beta, rmse
    

    where Y[-fc:] are the forecast points, alpha and beta are the values actually used and rmse is the root mean squared error. Unfortunately, as you can see, there are no upper or lower confidence intervals. By the way - we should probably refer to them as prediction intervals.

    Prediction intervals maths

    Holt's algorithm and Holt-Winters algorithm are exponential smoothing techniques and finding confidence intervals for predictions generated from them is a tricky subject. They have been referred to as "rule of thumb" methods and, in the case of the Holt-Winters multiplicative algorithm, without "underlying statistical model". However, the final footnote to this page asserts that:

    It is possible to calculate confidence intervals around long-term forecasts produced by exponential smoothing models, by considering them as special cases of ARIMA models. (Beware: not all software calculates confidence intervals for these models correctly.) The width of the confidence intervals depends on (i) the RMS error of the model, (ii) the type of smoothing (simple or linear); (iii) the value(s) of the smoothing constant(s); and (iv) the number of periods ahead you are forecasting. In general, the intervals spread out faster as α gets larger in the SES model and they spread out much faster when linear rather than simple smoothing is used.

    We see here that an ARIMA(0,2,2) model is equivalent to a Holt linear model with additive errors

    Prediction intervals code (i.e. how to proceed)

    You indicate in comments that you "can easily do this in R". I guess you may be used to the holt() function provided by the forecast package in R and therefore expecting such intervals. In which case - you can adapt the python library to give them to you on the same basis.

    Looking at the R holt code, we can see that it returns an object based on forecast(ets(...). Under the hood - this eventually calls to this function class1, which returns a mean mu and variance var (as well as cj which I have to confess I do not understand). The variance is used to calculate the upper and lower bounds here.

    To do something similar in Python - we would need to produce something similar to the class1 R function that estimates the variance of each prediction. This function takes the residuals found in model fitting and multiplies them by a factor at each time step to get the variance at that timestep. In the particular case of the linear Holt's algorithm, the factor is the cumulative sum of alpha + k*beta where k is the number of timesteps' prediction. Once you have that variance at each prediction point, treat the errors as normally distributed and get the X% value from the normal distribution.

    Here's an idea how to do this in Python (using the code I linked as your linear function)

    #Copy, import or reimplement the RMSE and linear function from
    #https://gist.github.com/andrequeiroz/5888967
    
    #factor in case there are not 1 timestep per day - in your case
    #assuming the timesteps are UTC epoch - I think they're 5 min
    # spaced i.e. 288 per day
    timesteps_per_day = 288
    
    # Note - big assumption here - your known data will be regular in time
    # i.e. timesteps_per_day observations per day.  From the timestamps this seems valid.
    # if you can't guarantee that - you'll need to interpolate the data
    def holt_predict(data, timestamps, forecast_days, pred_error_level = 0.95):
        forecast_timesteps = forecast_days*timesteps_per_day
        middle_predictions, alpha, beta, rmse = linear(data,int(forecast_timesteps))
        cum_error = [beta+alpha]
        for k in range(1,forecast_timesteps):
            cum_error.append(cum_error[k-1] + k*beta + alpha)
    
        cum_error = np.array(cum_error)
        #Use some numpy multiplication to get the intervals
        var = cum_error * rmse**2
        # find the correct ppf on the normal distribution (two-sided)
        p = abs(scipy.stats.norm.ppf((1-pred_error_level)/2))
        interval = np.sqrt(var) * p
        upper = middle_predictions + interval
        lower = middle_predictions - interval
        fcast_timestamps = [timestamps[-1] + i * 86400 / timesteps_per_day for i in range(forecast_timesteps)]
    
        ret_value = []
    
        ret_value.append({'target':'Forecast','datapoints': zip(middle_predictions, fcast_timestamps)})
        ret_value.append({'target':'Upper','datapoints':zip(upper,fcast_timestamps)})
        ret_value.append({'target':'Lower','datapoints':zip(lower,fcast_timestamps)})
        return ret_value
    
    if __name__=='__main__':
        import numpy as np
        import scipy.stats
        from math import sqrt
    
        null = None
        data_in = [{"target": "average", "datapoints": [[null, 1435688679],
        [34.870499801635745, 1435688694], [null, 1435688709], [null,
        1435688724], [null, 1435688739], [null, 1435688754], [null, 1435688769],
        [null, 1435688784], [null, 1435688799], [null, 1435688814], [null,
        1435688829], [null, 1435688844], [null, 1435688859], [null, 1435688874],
        [null, 1435688889], [null, 1435688904], [null, 1435688919], [null,
        1435688934], [null, 1435688949], [null, 1435688964], [null, 1435688979],
        [38.180000209808348, 1435688994], [null, 1435689009], [null,
        1435689024], [null, 1435689039], [null, 1435689054], [null, 1435689069],
        [null, 1435689084], [null, 1435689099], [null, 1435689114], [null,
        1435689129], [null, 1435689144], [null, 1435689159], [null, 1435689174],
        [null, 1435689189], [null, 1435689204], [null, 1435689219], [null,
        1435689234], [null, 1435689249], [null, 1435689264], [null, 1435689279],
        [30.79849989414215, 1435689294], [null, 1435689309], [null, 1435689324],
        [null, 1435689339], [null, 1435689354], [null, 1435689369], [null,
        1435689384], [null, 1435689399], [null, 1435689414], [null, 1435689429],
        [null, 1435689444], [null, 1435689459], [null, 1435689474], [null,
        1435689489], [null, 1435689504], [null, 1435689519], [null, 1435689534],
        [null, 1435689549], [null, 1435689564]]}]
    
        #translate the data.  There may be better ways if you're
        #prepared to use pandas / input data is proper json
        time_series = data_in[0]["datapoints"]
    
        epoch_in = []
        Y_observed = []
    
        for (y,x) in time_series:
            if y and x:
                epoch_in.append(x)
                Y_observed.append(y)
    
        #Pass in the number of days to forecast
        fcast_days = 30
        res = holt_predict(Y_observed,epoch_in,fcast_days)
        data_out = data_in + res
        #data_out now holds the data as you wanted it.
    
        #Optional plot of results
        import matplotlib.pyplot as plt
        plt.plot(epoch_in,Y_observed)
        m,tstamps = zip(*res[0]['datapoints'])
        u,tstamps = zip(*res[1]['datapoints'])
        l,tstamps = zip(*res[2]['datapoints'])
        plt.plot(tstamps,u, label='upper')
        plt.plot(tstamps,l, label='lower')
        plt.plot(tstamps,m, label='mean')
        plt.show()
    

    N.B. The output I've given adds points as tuple type into your object. If you really need list, then replace zip(upper,fcast_timestamps) with map(list,zip(upper,fcast_timestamps)) where the code adds upper, lower and Forecast dicts to the result.

    This code is for the particular case of the Holt's linear algorithm - it is not a generic way to calculate correct prediction intervals.

    Important note

    Your sample input data seems to have a lot of null and only 3 genuine data points. This is highly unlikely to be a good basis for doing timeseries prediction - especially as they all seem to be with 15 minutes and you're trying to forecast up to 3 months!. Indeed - if you feed that data into the R holt(), it will say:

    You've got to be joking. I need more data!

    i'm assuming you have a larger dataset to test on. I tried the code above on the stock market opening prices for 2015 and it seemed to give reasonable results (see below).

    Code used on 2015 stock market opening prices

    You may think the prediction intervals look a little wide. This blog from the author of the R forecast module implies that is intentional, though :

    "con­fi­dence inter­vals for the mean are much nar­rower than pre­dic­tion inter­vals"

    0 讨论(0)
  • 2021-01-31 04:13

    I've made a linear regression using gradient descent in Python, perhaps that can help you. After the data are initialized, the loss function is created, which is ordinary least squares (OLS). Then the gradient function is created, and this is used in an iterative procedure to find the optimal parameters. These can then be used for forecasting. Here's a working example about the prediction of the grade obtained, given the number of books and the number of hours in class.

    import numpy as np
    import matplotlib.pyplot as plt
    from mpl_toolkits.mplot3d import *
    from random import random, seed
    from matplotlib import cm
    
    # Initialize datapoints [# Books, hours in class], on the basis of which we want to make a prediction:
    #x = np.array([[2104,3],[1600,3],[2400,3],[1416,2],[300,4],[2001,3]])
    x = np.array([[0, 9],
    [1, 15],
    [0, 10],
    [2, 16],
    [4, 10],
    [4, 20],
    [1, 11],
    [4, 20],
    [3, 15],
    [0, 15],
    [2, 8],
    [1, 13],
    [4, 18],
    [1, 10],
    [0, 8],
    [1, 10],
    [3, 16],
    [0, 11],
    [1, 19],
    [4, 12],
    [4, 11],
    [0, 19],
    [2, 15],
    [3, 15],
    [1, 20],
    [0, 6],
    [3, 15],
    [3, 19],
    [2, 14],
    [2, 13],
    [3, 17],
    [2, 20],
    [2, 11],
    [3, 20],
    [4, 20],
    [4, 20],
    [3, 9],
    [1, 8],
    [2, 16],
    [0, 10]])
    
    # Initialize y, the grade obtained
    y = np.array([45,
    57,
    45,
    51,
    65,
    88,
    44,
    87,
    89,
    59,
    66,
    65,
    56,
    47,
    66,
    41,
    56,
    37,
    45,
    58,
    47,
    64,
    97,
    55,
    51,
    61,
    69,
    79,
    71,
    62,
    87,
    54,
    43,
    92,
    83,
    94,
    60,
    56,
    88,
    62])
    
    #Initialize starting point for theta, the parameter we want to optimize:
    #theta = np.array([90,0.1,-8])
    theta = np.array([1,1,1])
    
    # Initialize h_theta(x):
    def h(x, theta):
        return theta[0] + theta[1]*x[:,0] + theta[2]*x[:,1]
    
    # Initialize the loss function (ordinary least squares), J(theta):
    def lossf(theta):
        loss = np.sum((h(x,theta)-y)*(h(x,theta)-y))
        return loss
        '''
        for i in range(x.shape[0]):
            lossIncrement = (h(x[[i]],theta) - y[i])*(h(x[[i]],theta) - y[i])
            loss = loss + lossIncrement
    
        return 0.5*loss
        '''
    
    # Define the gradient of the loss function for any parameters theta, use vector calculations here instead of for loops:
    def gradlossf(theta):
        # dJ/dTheta0:
        d1 = np.sum(h(x,theta)-y)
        d2 = np.sum((h(x,theta)-y)*x[:,0])
        d3 = np.sum((h(x,theta)-y)*x[:,1])
        return np.array([d1,d2,d3])
    
    epsilon = 1e-2
    max_iter = 1000
    norm = np.linalg.norm(gradlossf(theta))
    alpha0 = 1
    iter = 0
    beta = 1e-4
    tolerance = 1e-6
    tau = 0.5
    
    # Start iterative procedure:
    
    while iter < max_iter and norm > epsilon:
        print theta
    
        alpha = alpha0
        alpha_found = False
    
        while alpha_found is False:
            xpoint = theta - alpha*gradlossf(theta)
            val1 = lossf(xpoint)
            val2 = lossf(theta)
    
            if val1<val2+tolerance:
                alpha_found = True
                print 'We choose alpha =',alpha
            else:
                alpha = tau*alpha
    
        iter = iter + 1
        theta = xpoint
        print 'New value of loss function is: ', val1
    
    
    # THE CORRECT VALUES: [37.379, 4.037, 1.283] (from SPSS) Residual: 7306.230
    
    # Plot datapoints
    fig = plt.figure()
    plt.hold(True)
    x1_surf=np.arange(0, 5, 0.1)                # generate a mesh
    x2_surf=np.arange(5, 25, 0.1)
    x1_surf, x2_surf = np.meshgrid(x1_surf, x2_surf)
    z_surf = theta[0] + theta[1]*x1_surf + theta[2]*x2_surf             # ex. function, which depends on x and y
    
    ax = fig.add_subplot(111, projection='3d')
    ax.plot_surface(x1_surf, x2_surf, z_surf, cmap=cm.hot);    # plot a 3d surface plot
    ax.scatter(x[:,0],x[:,1],y)
    
    ax.set_xlabel('# Books')
    ax.set_ylabel('Hours in class')
    ax.set_zlabel('Grade')
    
    plt.show()
    
    0 讨论(0)
提交回复
热议问题