python statsmodels linear regression

早过忘川 提交于 2019-12-24 06:23:14


I am attempting to make a linear regression model based on pre project data and ultimately attempt to calculate some modeled data where I could compare pre/post project data... Can anyone tell me what the best proactice is else I maybe off in the weeds somewhere...

For starters:

import statsmodels.api as sm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

ng = pd.read_csv('C:/Users/ngDataBaseline.csv',  thousands=',', index_col='Date', parse_dates=True)

This will output:

    HDD Therm
2011-10-01  386 498
2011-11-01  663 1810
2011-12-01  972 4263
2012-01-01  1131    5981
2012-02-01  977 6951

And from statsmodels to fit my model I am using:

import statsmodels.formula.api as smf

formula = 'Therm ~ HDD'
model = smf.ols(formula, data=ng)
results =

inter = results.params['Intercept']
slope = results.params['HDD']
inter, slope


(-532.6244255918659, 6.331883644532255)

So now I think I can import post project data and use some simple math in this format to calculate modeled data: Y = mX + b

ng_postproject = pd.read_csv('C:/Users/ng_postproject.csv',  thousands=',', index_col='Date', parse_dates=True)


And this will output:

    HDD Therm
2014-10-01  291 663
2014-11-01  545 1413
2014-12-01  1069    6754
2015-01-01  1134    7782
2015-02-01  1415    10285

This is what I am using to calculate a modeled Therm usage.

ng_postproject['Therm_modeled'] = ng_postproject['HDD'].apply(lambda x: x * slope + inter)


2014-10-01    1309.953715
2014-11-01    2918.252161
2014-12-01    6236.159190
2015-01-01    6647.731627
2015-02-01    8426.990931

Now if I am not too far off in the weeds I should be able to add in a column header and compare post/pre project data... It would be really nice too if I could implement a confidence interval as well... Thanks for any response.

