Fama Macbeth Regression in Python (Pandas or Statsmodels)

后端 未结 3 1286
南笙
南笙 2020-12-09 06:32

Econometric Backgroud

Fama Macbeth regression refers to a procedure to run regression for panel data (where there are N different individua

相关标签:
3条回答
  • 2020-12-09 06:36

    An update to reflect the library situation for Fama-MacBeth as of Fall 2018. The fama_macbeth function has been removed from pandas for a while now. So what are your options?

    1. If you're using python 3, then you can use the Fama-MacBeth method in LinearModels: https://github.com/bashtage/linearmodels/blob/master/linearmodels/panel/model.py

    2. If you're using python 2 or just don't want to use LinearModels, then probably your best option is to roll you own.

    For example, suppose you have the Fama-French industry portfolios in a panel like the following (you've also computed some variables like past beta or past returns to use as your x-variables):

    In [1]: import pandas as pd
            import numpy as np
            import statsmodels.formula.api as smf
    
    In [4]: df = pd.read_csv('industry.csv',parse_dates=['caldt'])
            df.query("caldt == '1995-07-01'")
    
    In [5]: Out[5]: 
          industry      caldt    ret    beta  r12to2  r36to13
    18432     Aero 1995-07-01   6.26  0.9696  0.2755   0.3466
    18433    Agric 1995-07-01   3.37  1.0412  0.1260   0.0581
    18434    Autos 1995-07-01   2.42  1.0274  0.0293   0.2902
    18435    Banks 1995-07-01   4.82  1.4985  0.1659   0.2951
    

    Fama-MacBeth primarily involves computing the same cross-sectional regression model month by month, so you can implement it using a groupby. You can create a function that takes a dataframe (it will come from the groupby) and a patsy formula; it then fits the model and returns the parameter estimates. Here is a barebones version of how you could implement it (note this is what the original questioner tried to do a few years ago ... not sure why it didn't work although it's possible back then statsmodels result object method params wasn't returning a pandas Series so the return needed to be converted to a Series explicitly ... it does work fine in the current version of pandas, 0.23.4):

    def ols_coef(x,formula):
        return smf.ols(formula,data=x).fit().params
    
    In [9]: gamma = (df.groupby('caldt')
                    .apply(ols_coef,'ret ~ 1 + beta + r12to2 + r36to13'))
            gamma.head()
    
    In [10]: Out[10]: 
                Intercept      beta     r12to2   r36to13
    caldt                                               
    1963-07-01  -1.497012 -0.765721   4.379128 -1.918083
    1963-08-01  11.144169 -6.506291   5.961584 -2.598048
    1963-09-01  -2.330966 -0.741550  10.508617 -4.377293
    1963-10-01   0.441941  1.127567   5.478114 -2.057173
    1963-11-01   3.380485 -4.792643   3.660940 -1.210426
    

    Then just compute the mean, standard error on the mean, and a t-test (or whatever statistics you want). Something like the following:

    def fm_summary(p):
        s = p.describe().T
        s['std_error'] = s['std']/np.sqrt(s['count'])
        s['tstat'] = s['mean']/s['std_error']
        return s[['mean','std_error','tstat']]
    
    In [12]: fm_summary(gamma)
    Out[12]: 
                   mean  std_error     tstat
    Intercept  0.754904   0.177291  4.258000
    beta      -0.012176   0.202629 -0.060092
    r12to2     1.794548   0.356069  5.039896
    r36to13    0.237873   0.186680  1.274230
    

    Improving Speed

    Using statsmodels for the regressions has significant overhead (particularly given you only need the estimated coefficients). If you want better efficiency, then you could switch from statsmodels to numpy.linalg.lstsq. Write a new function that does the ols estimation ... something like the following (notice I'm not doing anything like checking the rank of these matrices ...):

    def ols_np(data,yvar,xvar):
        gamma,_,_,_ = np.linalg.lstsq(data[xvar],data[yvar],rcond=None)
        return pd.Series(gamma)
    

    And if you're still using an older version of pandas, the following will work:

    Here is an example of using the fama_macbeth function in pandas:

    >>> df
    
                    y    x
    date       id
    2012-01-01 1   0.1  0.4
               2   0.3  0.6
               3   0.4  0.2
               4   0.0  1.2
    2012-02-01 1   0.2  0.7
               2   0.4  0.5
               3   0.2  0.1
               4   0.1  0.0
    2012-03-01 1   0.4  0.8
               2   0.6  0.1
               3   0.7  0.6
               4   0.4 -0.1
    

    Notice, the structure. The fama_macbeth function expects the y-var and x-vars to have a multi-index with date as the first variable and the stock/firm/entity id as the second variable in the index:

    >>> fm  = pd.fama_macbeth(y=df['y'],x=df[['x']])
    >>> fm
    
    
    ----------------------Summary of Fama-MacBeth Analysis-------------------------
    
    Formula: Y ~ x + intercept
    # betas :   3
    
    ----------------------Summary of Estimated Coefficients------------------------
         Variable          Beta       Std Err        t-stat       CI 2.5%      CI 97.5%
              (x)       -0.0227        0.1276         -0.18       -0.2728        0.2273
      (intercept)        0.3531        0.0842          4.19        0.1881        0.5181
    
    --------------------------------End of Summary---------------------------------
    

    Note that just printing fm calls fm.summary

    >>> fm.summary
    
    ----------------------Summary of Fama-MacBeth Analysis-------------------------
    
    Formula: Y ~ x + intercept
    # betas :   3
    
    ----------------------Summary of Estimated Coefficients------------------------
         Variable          Beta       Std Err        t-stat       CI 2.5%      CI 97.5%
              (x)       -0.0227        0.1276         -0.18       -0.2728        0.2273
      (intercept)        0.3531        0.0842          4.19        0.1881        0.5181
    
    --------------------------------End of Summary---------------------------------
    

    Also, note the fama_macbeth function automatically adds an intercept (as opposed to statsmodels routines). Also the x-var has to be a dataframe so if you pass just one column you need to pass it as df[['x']].

    If you don't want an intercept you have to do:

    >>> fm  = pd.fama_macbeth(y=df['y'],x=df[['x']],intercept=False)
    
    0 讨论(0)
  • 2020-12-09 06:36

    A quick and dirty solution to solve the problem and continue using the same thing you were using.

    It worked for me.

    def fmreg(data,formula):
        return smf.ols(formula,data=data).fit().params[:]
    
    res = df.groupby('date').apply(fmreg,'ret~var1+var2+var3')
    
    0 讨论(0)
  • 2020-12-09 06:58

    EDIT: New Library

    An updated library exists which can be installed via the following command:

    pip install finance-byu
    

    Documentation here: https://fin-library.readthedocs.io/en/latest/

    The new library includes Fama Macbeth regression implementations and a Regtable class that can be helpful for reporting results.

    This page in the documentation outlines the Fama Macbeth functions: https://fin-library.readthedocs.io/en/latest/fama_macbeth.html

    There is an implementation which is very similar to Karl D.'s implementation above with numpy's linear algebra functions, an implementation that utilizes joblib for parallelization to increase performance when a large number of time periods in the data, and an implementation using numba for optimization that shaves off an order of magnitude on small data sets.

    Here is an example with a small simulated data set as in the documentation:

    >>> from finance_byu.fama_macbeth import fama_macbeth, fama_macbeth_parallel, fm_summary, fama_macbeth_numba
    >>> import pandas as pd
    >>> import time
    >>> import numpy as np
    >>> 
    >>> n_jobs = 5
    >>> n_firms = 1.0e2
    >>> n_periods = 1.0e2
    >>> 
    >>> def firm(fid):
    >>>     f = np.random.random((int(n_periods),4))
    >>>     f = pd.DataFrame(f)
    >>>     f['period'] = f.index
    >>>     f['firmid'] = fid
    >>>     return f
    >>> df = [firm(i) for i in range(int(n_firms))]
    >>> df = pd.concat(df).rename(columns={0:'ret',1:'exmkt',2:'smb',3:'hml'})
    >>> df.head()
    
            ret     exmkt       smb       hml  period  firmid
    0  0.766593  0.002390  0.496230  0.992345       0       0
    1  0.346250  0.509880  0.083644  0.732374       1       0
    2  0.787731  0.204211  0.705075  0.313182       2       0
    3  0.904969  0.338722  0.437298  0.669285       3       0
    4  0.121908  0.827623  0.319610  0.455530       4       0
    
    >>> result = fama_macbeth(df,'period','ret',['exmkt','smb','hml'],intercept=True)
    >>> result.head()
    
            intercept     exmkt       smb       hml
    period                                         
    0        0.655784 -0.160938 -0.109336  0.028015
    1        0.455177  0.033941  0.085344  0.013814
    2        0.410705 -0.084130  0.218568  0.016897
    3        0.410537  0.010719  0.208912  0.001029
    4        0.439061  0.046104 -0.084381  0.199775
    
    >>> fm_summary(result)
    
                   mean  std_error      tstat
    intercept  0.506834   0.008793  57.643021
    exmkt      0.004750   0.009828   0.483269
    smb       -0.012702   0.010842  -1.171530
    hml        0.004276   0.010530   0.406119
    
    >>> %timeit fama_macbeth(df,'period','ret',['exmkt','smb','hml'],intercept=True)
    123 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 10 loops each  
    >>> %timeit fama_macbeth_parallel(df,'period','ret',['exmkt','smb','hml'],intercept=True,n_jobs=n_jobs,memmap=False)  
    146 ms ± 16.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    >>> %timeit fama_macbeth_numba(df,'period','ret',['exmkt','smb','hml'],intercept=True)
    5.04 ms ± 5.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    Note: Turning off the memmap makes for fair comparison without generating new data at each run. With the memmap, the parallel implementation would simply pull cached results.

    Here are a couple simple implementations of the table class also using simulated data:

    >>> from finance_byu.regtables import Regtable
    >>> import pandas as pd
    >>> import statsmodels.formula.api as smf
    >>> import numpy as np
    >>> 
    >>> 
    >>> nobs = 1000
    >>> df = pd.DataFrame(np.random.random((nobs,3))).rename(columns={0:'age',1:'bmi',2:'hincome'})
    >>> df['age'] = df['age']*100
    >>> df['bmi'] = df['bmi']*30
    >>> df['hincome'] = df['hincome']*100000
    >>> df['hincome'] = pd.qcut(df['hincome'],16,labels=False)
    >>> df['rich'] = df['hincome'] > 13
    >>> df['gender'] = np.random.choice(['M','F'],nobs)
    >>> df['race'] = np.random.choice(['W','B','H','O'],nobs)
    >>> 
    >>> regformulas =  ['bmi ~ age',
    >>>                 'bmi ~ np.log(age)',
    >>>                 'bmi ~ C(gender) + np.log(age)',
    >>>                 'bmi ~ C(gender) + C(race) + np.log(age)',
    >>>                 'bmi ~ C(gender) + rich + C(gender)*rich + C(race) + np.log(age)',
    >>>                 'bmi ~ -1 + np.log(age)',
    >>>                 'bmi ~ -1 + C(race) + np.log(age)']
    >>> reg = [smf.ols(f,df).fit() for f in regformulas]
    >>> tbl = Regtable(reg)
    >>> tbl.render()
    

    Produces the following:

    >>> df2 = pd.DataFrame(np.random.random((nobs,10)))
    >>> df2.columns = ['t0_vw','t4_vw','et_vw','t0_ew','t4_ew','et_ew','mktrf','smb','hml','umd']
    >>> regformulas2 = ['t0_vw ~ mktrf',
    >>>                't0_vw ~ mktrf + smb + hml',
    >>>                't0_vw ~ mktrf + smb + hml + umd',
    >>>                't4_vw ~ mktrf',
    >>>                't4_vw ~ mktrf + smb + hml',
    >>>                't4_vw ~ mktrf + smb + hml + umd',
    >>>                'et_vw ~ mktrf',
    >>>                'et_vw ~ mktrf + smb + hml',
    >>>                'et_vw ~ mktrf + smb + hml + umd',
    >>>                't0_ew ~ mktrf',
    >>>                't0_ew ~ mktrf + smb + hml',
    >>>                't0_ew ~ mktrf + smb + hml + umd',
    >>>                't4_ew ~ mktrf',
    >>>                't4_ew ~ mktrf + smb + hml',
    >>>                't4_ew ~ mktrf + smb + hml + umd',
    >>>                'et_ew ~ mktrf',
    >>>                'et_ew ~ mktrf + smb + hml',
    >>>                'et_ew ~ mktrf + smb + hml + umd'
    >>>                ]
    >>> regnames = ['Small VW','','',
    >>>             'Large VW','','',
    >>>             'Spread VW','','',
    >>>             'Small EW','','',
    >>>             'Large EW','','',
    >>>             'Spread EW','',''
    >>>             ]
    >>> reg2 = [smf.ols(f,df2).fit() for f in regformulas2]
    >>> 
    >>> tbl2 = Regtable(reg2,orientation='horizontal',regnames=regnames,sig='coeff',intercept_name='alpha',nobs=False,rsq=False,stat='se')
    >>> tbl2.render()
    

    Produces the following:

    The documentation for the Regtable class is here: https://byu-finance-library-finance-byu.readthedocs.io/en/latest/regtables.html

    These tables can be exported to LaTeX for easy incorporation into writing:

    tbl.to_latex()
    
    0 讨论(0)
提交回复
热议问题