Using categorical variables in statsmodels OLS class

后端 未结 2 367
日久生厌
日久生厌 2021-01-21 12:57

I want to use statsmodels OLS class to create a multiple regression model. Consider the following dataset:

import statsmodels.api as sm
import pand         


        
2条回答
  •  说谎
    说谎 (楼主)
    2021-01-21 13:45

    I also had this problem as well and have lots of columns needed to be treated as categorical, and this makes it quite annoying to deal with dummify. And converting to string doesn't work for me.

    For anyone looking for a solution without onehot-encoding the data, The R interface provides a nice way of doing this:

    import statsmodels.formula.api as smf
    import pandas as pd
    import numpy as np
    
    dict = {'industry': ['mining', 'transportation', 'hospitality', 'finance', 'entertainment'],
      'debt_ratio':np.random.randn(5), 'cash_flow':np.random.randn(5) + 90} 
    
    df = pd.DataFrame.from_dict(dict)
    
    x = df[['debt_ratio', 'industry']]
    y = df['cash_flow']
    
    # NB. unlike sm.OLS, there is "intercept" term is included here
    smf.ols(formula="cash_flow ~ debt_ratio + C(industry)", data=df).fit()
    

    Reference: https://www.statsmodels.org/stable/example_formulas.html#categorical-variables

提交回复
热议问题