问题
I'm doing logistic regression using pandas 0.11.0
(data handling) and statsmodels 0.4.3
to do the actual regression, on Mac OSX Lion.
I'm going to be running ~2,900 different logistic regression models and need the results output to csv file and formatted in a particular way.
Currently, I'm only aware of doing print result.summary()
which prints the results (as follows) to the shell:
Logit Regression Results
==============================================================================
Dep. Variable: death_death No. Observations: 9752
Model: Logit Df Residuals: 9747
Method: MLE Df Model: 4
Date: Wed, 22 May 2013 Pseudo R-squ.: -0.02672
Time: 22:15:05 Log-Likelihood: -5806.9
converged: True LL-Null: -5655.8
LLR p-value: 1.000
===============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
-------------------------------------------------------------------------------
age_age5064 -0.1999 0.055 -3.619 0.000 -0.308 -0.092
age_age6574 -0.2553 0.053 -4.847 0.000 -0.359 -0.152
sex_female -0.2515 0.044 -5.765 0.000 -0.337 -0.166
stage_early -0.1838 0.041 -4.528 0.000 -0.263 -0.104
access -0.0102 0.001 -16.381 0.000 -0.011 -0.009
===============================================================================
I will also need the odds ratio, which is computed by print np.exp(result.params)
, and is printed in the shell as such:
age_age5064 0.818842
age_age6574 0.774648
sex_female 0.777667
stage_early 0.832098
access 0.989859
dtype: float64
What I need is for these each to be written to a csv file in form of a very lon row like (am not sure, at this point, whether I will need things like Log-Likelihood
, but have included it for the sake of thoroughness):
`Log-Likelihood, age_age5064_coef, age_age5064_std_err, age_age5064_z, age_age5064_p>|z|,...age_age6574_coef, age_age6574_std_err, ......access_coef, access_std_err, ....age_age5064_odds_ratio, age_age6574_odds_ratio, ...sex_female_odds_ratio,.....access_odds_ratio`
I think you get the picture - a very long row, with all of these actual values, and a header with all the column designations in a similar format.
I am familiar with the csv module
in Python, and am becoming more familiar with pandas
. Not sure whether this info could be formatted and stored in a pandas dataframe
and then written, using to_csv
to a file once all ~2,900 logistic regression models have completed; that would certainly be fine. Also, writing them as each model is completed is also fine (using csv module
).
UPDATE:
So, I was looking more at statsmodels site, specifically trying to figure out how the results of a model are stored within classes. It looks like there is a class called 'Results', which will need to be used. I think using inheritance from this class to create another class, where some of the methods/operators get changed might be the way to go, in order to get the formatting I require. I have very little experience in the ways of doing this, and will need to spend quite a bit of time figuring this out (which is fine). If anybody can help/has more experience that would be awesome!
Here is the site where the classes are laid out: statsmodels results class
回答1:
There is no premade table of parameters and their result statistics currently available.
Essentially you need to stack all the results yourself, whether in a list, numpy array or pandas DataFrame depends on what's more convenient for you.
for example, if I want one numpy array that has the results for a model, llf and results in the summary parameter table, then I could use
res_all = []
for res in results:
low, upp = res.confint().T # unpack columns
res_all.append(numpy.concatenate(([res.llf], res.params, res.tvalues, res.pvalues,
low, upp)))
But it might be better to align with pandas, depending on what structure you have across models.
You could write a helper function that takes all the results from the results instance and concatenates them in a row.
(I'm not sure what's the most convenient for writing to csv by rows)
edit:
Here is an example storing the regression results in a dataframe
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/multilinear.py#L21
the loop is on line 159.
summary() and similar code outside of statsmodels, for example http://johnbeieler.org/py_apsrtable/ for combining several results, is oriented towards printing and not to store variables.
回答2:
I found this formulation to be a little more straightforward. You can add/subtract columns by following the syntax from the examples (pvals,coeff,conf_lower,conf_higher).
import pandas as pd #This can be left out if already present...
def results_summary_to_dataframe(results):
'''This takes the result of an statsmodel results table and transforms it into a dataframe'''
pvals = results.pvalues
coeff = results.params
conf_lower = results.conf_int()[0]
conf_higher = results.conf_int()[1]
results_df = pd.DataFrame({"pvals":pvals,
"coeff":coeff,
"conf_lower":conf_lower,
"conf_higher":conf_higher
})
#Reordering...
results_df = results_df[["coeff","pvals","conf_lower","conf_higher"]]
return results_df
回答3:
- results.params : for coefficient
- results.pvalues : for p-values
BTW you can use dir(results) to find out all the attribute of an object
回答4:
write_path = '/my/path/here/output.csv'
with open(write_path, 'w') as f:
f.write(result.summary().as_csv())
回答5:
There is actually a built-in method documented in the documentation here:
f = open('csvfile.csv','w')
f.write(result.summary().as_csv())
f.close()
I believe this is a much easier (and clean) way to output the summaries to csv files.
来源:https://stackoverflow.com/questions/16705598/python-2-7-statsmodels-formatting-and-writing-summary-output