Deprecated rolling window option in OLS from Pandas to Statsmodels

前端 未结 4 2013
轻奢々
轻奢々 2020-12-07 17:16

as the title suggests, where has the rolling function option in the ols command in Pandas migrated to in statsmodels? I can\'t seem to find it. Pandas tells me doom is in th

相关标签:
4条回答
  • 2020-12-07 17:32

    For rolling trend in one column, one can just use:

    import numpy as np
    def calc_trend(window:int = 30):
        df['trend'] = df.rolling(window = window)['column_name'].apply(lambda x: np.polyfit(np.array(range(0,window)), x, 1)[0], raw=True)
    

    However, in my case I wasted to find a trend with respect to date, where date was in another column. I had to create the functionality manually, but it is easy. First, convert from TimeDate to int64 representing days from t_0:

    xdays = (df['Date'].values.astype('int64') - df['Date'][0].value) / (1e9*86400)
    

    Then:

    def calc_trend(window:int=30):
        for t in range(len(df)):
            if t < window//2:
                continue
            i0 = t  - window//2 # Start window
            i1 = i0 + window    # End window
            xvec = xdays[i0:i1]
            yvec = df['column_name'][i0:i1].values
            df.loc[t,('trend')] = np.polyfit(xvec, yvec, 1)[0]
    
    0 讨论(0)
  • 2020-12-07 17:37

    I created an ols module designed to mimic pandas' deprecated MovingOLS; it is here.

    It has three core classes:

    • OLS : static (single-window) ordinary least-squares regression. The output are NumPy arrays
    • RollingOLS : rolling (multi-window) ordinary least-squares regression. The output are higher-dimension NumPy arrays.
    • PandasRollingOLS : wraps the results of RollingOLS in pandas Series & DataFrames. Designed to mimic the look of the deprecated pandas module.

    Note that the module is part of a package (which I'm currently in the process of uploading to PyPi) and it requires one inter-package import.

    The first two classes above are implemented entirely in NumPy and primarily use matrix algebra. RollingOLS takes advantage of broadcasting extensively also. Attributes largely mimic statsmodels' OLS RegressionResultsWrapper.

    An example:

    import urllib.parse
    import pandas as pd
    from pyfinance.ols import PandasRollingOLS
    
    # You can also do this with pandas-datareader; here's the hard way
    url = "https://fred.stlouisfed.org/graph/fredgraph.csv"
    
    syms = {
        "TWEXBMTH" : "usd", 
        "T10Y2YM" : "term_spread", 
        "GOLDAMGBD228NLBM" : "gold",
    }
    
    params = {
        "fq": "Monthly,Monthly,Monthly",
        "id": ",".join(syms.keys()),
        "cosd": "2000-01-01",
        "coed": "2019-02-01",
    }
    
    data = pd.read_csv(
        url + "?" + urllib.parse.urlencode(params, safe=","),
        na_values={"."},
        parse_dates=["DATE"],
        index_col=0
    ).pct_change().dropna().rename(columns=syms)
    print(data.head())
    #                  usd  term_spread      gold
    # DATE                                       
    # 2000-02-01  0.012580    -1.409091  0.057152
    # 2000-03-01 -0.000113     2.000000 -0.047034
    # 2000-04-01  0.005634     0.518519 -0.023520
    # 2000-05-01  0.022017    -0.097561 -0.016675
    # 2000-06-01 -0.010116     0.027027  0.036599
    
    y = data.usd
    x = data.drop('usd', axis=1)
    
    window = 12  # months
    model = PandasRollingOLS(y=y, x=x, window=window)
    
    print(model.beta.head())  # Coefficients excluding the intercept
    #             term_spread      gold
    # DATE                             
    # 2001-01-01     0.000033 -0.054261
    # 2001-02-01     0.000277 -0.188556
    # 2001-03-01     0.002432 -0.294865
    # 2001-04-01     0.002796 -0.334880
    # 2001-05-01     0.002448 -0.241902
    
    print(model.fstat.head())
    # DATE
    # 2001-01-01    0.136991
    # 2001-02-01    1.233794
    # 2001-03-01    3.053000
    # 2001-04-01    3.997486
    # 2001-05-01    3.855118
    # Name: fstat, dtype: float64
    
    print(model.rsq.head())  # R-squared
    # DATE
    # 2001-01-01    0.029543
    # 2001-02-01    0.215179
    # 2001-03-01    0.404210
    # 2001-04-01    0.470432
    # 2001-05-01    0.461408
    # Name: rsq, dtype: float64
    
    0 讨论(0)
  • 2020-12-07 17:42

    Adding for completeness a speedier numpy-only solution which limits calculations only to the regression coefficients and the final estimate

    Numpy rolling regression function

    import numpy as np
    
    def rolling_regression(y, x, window=60):
        """ 
        y and x must be pandas.Series
        """
    # === Clean-up ============================================================
        x = x.dropna()
        y = y.dropna()
    # === Trim acc to shortest ================================================
        if x.index.size > y.index.size:
            x = x[y.index]
        else:
            y = y[x.index]
    # === Verify enough space =================================================
        if x.index.size < window:
            return None
        else:
        # === Add a constant if needed ========================================
            X = x.to_frame()
            X['c'] = 1
        # === Loop... this can be improved ====================================
            estimate_data = []
            for i in range(window, x.index.size+1):
                X_slice = X.values[i-window:i,:] # always index in np as opposed to pandas, much faster
                y_slice = y.values[i-window:i]
                coeff = np.dot(np.dot(np.linalg.inv(np.dot(X_slice.T, X_slice)), X_slice.T), y_slice)
                estimate_data.append(coeff[0] * x.values[window-1] + coeff[1])
        # === Assemble ========================================================
            estimate = pandas.Series(data=estimate_data, index=x.index[window-1:]) 
            return estimate             
    

    Notes

    In some specific case uses, which only require the final estimate of the regression, x.rolling(window=60).apply(my_ols) appears to be somewhat slow

    As a reminder, the coefficients for a regression can be calculated as a matrix product, as you can read on wikipedia's least squares page. This approach via numpy's matrix multiplication can speed up the process somewhat vs using the ols in statsmodels. This product is expressed in the line starting as coeff = ...

    0 讨论(0)
  • 2020-12-07 17:45

    Rolling beta with sklearn

    import pandas as pd
    from sklearn import linear_model
    
    def rolling_beta(X, y, idx, window=255):
    
        assert len(X)==len(y)
    
        out_dates = []
        out_beta = []
    
        model_ols = linear_model.LinearRegression()
    
        for iStart in range(0, len(X)-window):        
            iEnd = iStart+window
    
            model_ols.fit(X[iStart:iEnd], y[iStart:iEnd])
    
            #store output
            out_dates.append(idx[iEnd])
            out_beta.append(model_ols.coef_[0][0])
    
        return pd.DataFrame({'beta':out_beta}, index=out_dates)
    
    
    df_beta = rolling_beta(df_rtn_stocks['NDX'].values.reshape(-1, 1), df_rtn_stocks['CRM'].values.reshape(-1, 1), df_rtn_stocks.index.values, 255)
    
    0 讨论(0)
提交回复
热议问题