I want to use statsmodels
OLS class to create a multiple regression model. Consider the following dataset:
import statsmodels.api as sm
import pand
I also had this problem as well and have lots of columns needed to be treated as categorical, and this makes it quite annoying to deal with dummify
. And converting to string
doesn't work for me.
For anyone looking for a solution without onehot-encoding the data, The R interface provides a nice way of doing this:
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
dict = {'industry': ['mining', 'transportation', 'hospitality', 'finance', 'entertainment'],
'debt_ratio':np.random.randn(5), 'cash_flow':np.random.randn(5) + 90}
df = pd.DataFrame.from_dict(dict)
x = df[['debt_ratio', 'industry']]
y = df['cash_flow']
# NB. unlike sm.OLS, there is "intercept" term is included here
smf.ols(formula="cash_flow ~ debt_ratio + C(industry)", data=df).fit()
Reference: https://www.statsmodels.org/stable/example_formulas.html#categorical-variables