Fixed effect in Pandas or Statsmodels

后端 未结 2 1528
一个人的身影
一个人的身影 2020-12-01 05:48

Is there an existing function to estimate fixed effect (one-way or two-way) from Pandas or Statsmodels.

There used to be a function in Statsmodels but it seems disco

相关标签:
2条回答
  • 2020-12-01 06:14

    As noted in the comments, PanelOLS has been removed from Pandas as of version 0.20.0. So you really have three options:

    1. If you use Python 3 you can use linearmodels as specified in the more recent answer: https://stackoverflow.com/a/44836199/3435183

    2. Just specify various dummies in your statsmodels specification, e.g. using pd.get_dummies. May not be feasible if the number of fixed effects is large.

    3. Or do some groupby based demeaning and then use statsmodels (this would work if you're estimating lots of fixed effects). Here is a barebones version of what you could do for one way fixed effects:

      import statsmodels.api as sm
      import statsmodels.formula.api as smf
      import patsy
      
      def areg(formula,data=None,absorb=None,cluster=None): 
      
          y,X = patsy.dmatrices(formula,data,return_type='dataframe')
      
          ybar = y.mean()
          y = y -  y.groupby(data[absorb]).transform('mean') + ybar
      
          Xbar = X.mean()
          X = X - X.groupby(data[absorb]).transform('mean') + Xbar
      
          reg = sm.OLS(y,X)
          # Account for df loss from FE transform
          reg.df_resid -= (data[absorb].nunique() - 1)
      
          return reg.fit(cov_type='cluster',cov_kwds={'groups':data[cluster].values})
      

    For example, suppose you have a panel of stock data: stock returns and other stock data for all stocks, every month over a number of months and you want to regress returns on lagged returns with calendar month fixed effects (where the calender month variable is called caldt) and you also want to cluster the standard errors by calendar month. You can estimate such a fixed effect model with the following:

    reg0 = areg('ret~retlag',data=df,absorb='caldt',cluster='caldt')
    

    And here is what you can do if using an older version of Pandas:

    An example with time fixed effects using pandas' PanelOLS (which is in the plm module). Notice, the import of PanelOLS:

    >>> from pandas.stats.plm import PanelOLS
    >>> df
    
                    y    x
    date       id
    2012-01-01 1   0.1  0.2
               2   0.3  0.5
               3   0.4  0.8
               4   0.0  0.2
    2012-02-01 1   0.2  0.7 
               2   0.4  0.5
               3   0.2  0.3
               4   0.1  0.1
    2012-03-01 1   0.6  0.9
               2   0.7  0.5
               3   0.9  0.6
               4   0.4  0.5
    

    Note, the dataframe must have a multindex set ; panelOLS determines the time and entity effects based on the index:

    >>> reg  = PanelOLS(y=df['y'],x=df[['x']],time_effects=True)
    >>> reg
    
    -------------------------Summary of Regression Analysis-------------------------
    
    Formula: Y ~ <x>
    
    Number of Observations:         12
    Number of Degrees of Freedom:   4
    
    R-squared:         0.2729
    Adj R-squared:     0.0002
    
    Rmse:              0.1588
    
    F-stat (1, 8):     1.0007, p-value:     0.3464
    
    Degrees of Freedom: model 3, resid 8
    
    -----------------------Summary of Estimated Coefficients------------------------
          Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
    --------------------------------------------------------------------------------
                 x     0.3694     0.2132       1.73     0.1214    -0.0485     0.7872
    ---------------------------------End of Summary--------------------------------- 
    

    Docstring:

    PanelOLS(self, y, x, weights = None, intercept = True, nw_lags = None,
    entity_effects = False, time_effects = False, x_effects = None,
    cluster = None, dropped_dummies = None, verbose = False,
    nw_overlap = False)
    
    Implements panel OLS.
    
    See ols function docs
    

    This is another function (like fama_macbeth) where I believe the plan is to move this functionality to statsmodels.

    0 讨论(0)
  • 2020-12-01 06:14

    There is a package called linearmodels (https://pypi.org/project/linearmodels/) that has a fairly complete fixed effects and random effects implementation including clustered standard errors. It does not use high-dimensional OLS to eliminate effects and so can be used with large data sets.

    # Outer is entity, inner is time
    entity = list(map(chr,range(65,91)))
    time = list(pd.date_range('1-1-2014',freq='A', periods=4))
    index = pd.MultiIndex.from_product([entity, time])
    df = pd.DataFrame(np.random.randn(26*4, 2),index=index, columns=['y','x'])
    
    from linearmodels.panel import PanelOLS
    mod = PanelOLS(df.y, df.x, entity_effects=True)
    res = mod.fit(cov_type='clustered', cluster_entity=True)
    print(res)
    

    This produces the following output:

                              PanelOLS Estimation Summary                           
    ================================================================================
    Dep. Variable:                      y   R-squared:                        0.0029
    Estimator:                   PanelOLS   R-squared (Between):             -0.0109
    No. Observations:                 104   R-squared (Within):               0.0029
    Date:                Thu, Jun 29 2017   R-squared (Overall):             -0.0007
    Time:                        23:52:28   Log-likelihood                   -125.69
    Cov. Estimator:             Clustered                                           
                                            F-statistic:                      0.2256
    Entities:                          26   P-value                           0.6362
    Avg Obs:                       4.0000   Distribution:                    F(1,77)
    Min Obs:                       4.0000                                           
    Max Obs:                       4.0000   F-statistic (robust):             0.1784
                                            P-value                           0.6739
    Time periods:                       4   Distribution:                    F(1,77)
    Avg Obs:                       26.000                                           
    Min Obs:                       26.000                                           
    Max Obs:                       26.000                                           
    
                                 Parameter Estimates                              
    ==============================================================================
                Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
    ------------------------------------------------------------------------------
    x              0.0573     0.1356     0.4224     0.6739     -0.2127      0.3273
    ==============================================================================
    
    F-test for Poolability: 1.0903
    P-value: 0.3739
    Distribution: F(25,77)
    
    Included effects: Entity
    

    It also has a formula interface which is similar to statsmodels,

    mod = PanelOLS.from_formula('y ~ x + EntityEffects', df)
    
    0 讨论(0)
提交回复
热议问题