Ignoring missing values in multiple OLS regression with statsmodels

前端 未结 2 805
小鲜肉
小鲜肉 2021-02-13 03:35

I\'m trying to run a multiple OLS regression using statsmodels and a pandas dataframe. There are missing values in different columns for different rows, and I keep getting the e

相关标签:
2条回答
  • 2021-02-13 04:23

    The answer from jseabold works very well, but it may be not enough if you the want to do some computation on the predicted values and true values, e.g. if you want to use the function mean_squared_error. In that case, it may be better to get definitely rid of NaN

    df = pd.read_csv('cl_030314.csv')
    df_cleaned = df.dropna()
    results = sm.ols(formula = "da ~ cfo + rm_proxy + cpi + year", data=df_cleaned).fit()
    
    0 讨论(0)
  • 2021-02-13 04:31

    You answered your own question. Just pass

    missing = 'drop'
    

    to ols

    import statsmodels.formula.api as smf
    ...
    results = smf.ols(formula = "da ~ cfo + rm_proxy + cpi + year", 
                     data=df, missing='drop').fit()
    

    If this doesn't work then it's a bug and please report it with a MWE on github.

    FYI, note the import above. Not everything is available in the formula.api namespace, so you should keep it separate from statsmodels.api. Or just use

    import statsmodels.api as sm
    sm.formula.ols(...)
    
    0 讨论(0)
提交回复
热议问题