Efficient Python Pandas Stock Beta Calculation on Many Dataframes

前端 未结 6 1279
太阳男子
太阳男子 2020-12-07 13:29

I have many (4000+) CSVs of stock data (Date, Open, High, Low, Close) which I import into individual Pandas dataframes to perform analysis. I am new to python and want to c

6条回答
  •  囚心锁ツ
    2020-12-07 14:13

    Using a generator to improve memory efficiency

    Simulated data

    m, n = 480, 10000
    dates = pd.date_range('1995-12-31', periods=m, freq='M', name='Date')
    stocks = pd.Index(['s{:04d}'.format(i) for i in range(n)])
    df = pd.DataFrame(np.random.rand(m, n), dates, stocks)
    market = pd.Series(np.random.rand(m), dates, name='Market')
    df = pd.concat([df, market], axis=1)
    

    Beta Calculation

    def beta(df, market=None):
        # If the market values are not passed,
        # I'll assume they are located in a column
        # named 'Market'.  If not, this will fail.
        if market is None:
            market = df['Market']
            df = df.drop('Market', axis=1)
        X = market.values.reshape(-1, 1)
        X = np.concatenate([np.ones_like(X), X], axis=1)
        b = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(df.values)
        return pd.Series(b[1], df.columns, name=df.index[-1])
    

    roll function
    This returns a generator and will be far more memory efficient

    def roll(df, w):
        for i in range(df.shape[0] - w + 1):
            yield pd.DataFrame(df.values[i:i+w, :], df.index[i:i+w], df.columns)
    

    Putting it all together

    betas = pd.concat([beta(sdf) for sdf in roll(df.pct_change().dropna(), 12)], axis=1).T
    

    Validation

    OP beta calc

    def calc_beta(df):
        np_array = df.values
        m = np_array[:,0] # market returns are column zero from numpy array
        s = np_array[:,1] # stock returns are column one from numpy array
        covariance = np.cov(s,m) # Calculate covariance between stock and market
        beta = covariance[0,1]/covariance[1,1]
        return beta
    

    Experiment setup

    m, n = 12, 2
    dates = pd.date_range('1995-12-31', periods=m, freq='M', name='Date')
    
    cols = ['Open', 'High', 'Low', 'Close']
    dfs = {'s{:04d}'.format(i): pd.DataFrame(np.random.rand(m, 4), dates, cols) for i in range(n)}
    
    market = pd.Series(np.random.rand(m), dates, name='Market')
    
    df = pd.concat([market] + [dfs[k].Close.rename(k) for k in dfs.keys()], axis=1).sort_index(1)
    
    betas = pd.concat([beta(sdf) for sdf in roll(df.pct_change().dropna(), 12)], axis=1).T
    
    for c, col in betas.iteritems():
        dfs[c]['Beta'] = col
    
    dfs['s0000'].head(20)
    

    calc_beta(df[['Market', 's0000']])
    
    0.0020118230147777435
    

    NOTE:
    The calculations are the same

提交回复
热议问题