问题
I am having a lot of difficulty using the statsmodels.formula.api function
ols(formula,data).fit().rsquared_adj
due to the nature of the names of my predictors. The predictors have numbers and spaces etc in them which it clearly doesn't like. I understand that I need to use something like patsy.builtins.Q So lets say my predictor would be weight.in.kg , it should be entered as follows:
Q("weight.in.kg")
so I need to take my formula from a list, and the difficulty arises in modifying every item in the list with this patsy.builtin.Q
formula = "{} ~ {} + 1".format(response, ' + '.join([candidate])
with [candidate] being my list of predictors.
My question to you, dearest python experts, is how on earth do I put every individual item in the list [candidate] within the quotes in the following expression:
Q('')
so that the ols function can actually read it? Apologies if this is super obvious, me no good at python.
回答1:
Right now you're starting with a list of terms that you want in your formula, then trying to paste them together into a complicated string, which patsy will parse and convert back into a list of terms. You can see the data structure that patsy generates for this kind of formula (ModelDesc.from_formula
is patsy's parser):
In [7]: from patsy import ModelDesc
In [8]: ModelDesc.from_formula("y ~ x1 + x2 + x3")
Out[8]:
ModelDesc(lhs_termlist=[Term([EvalFactor('y')])],
rhs_termlist=[Term([]),
Term([EvalFactor('x1')]),
Term([EvalFactor('x2')]),
Term([EvalFactor('x3')])])
This might look a little intimidating, but it's pretty simple really -- you have a ModelDesc
, which represents a single formula, and it has a left-hand-side list of terms and a right-hand-side list of terms. Each term is represented by a Term
object, and each Term
has a list of factors. (Here each term just has a single factor -- if you had any interactions then those terms would have multiple factors.) Also, the "empty interaction" Term([])
is how patsy represents the intercept term.
So you can avoid all this complicated quoting/parsing stuff by directly creating the terms you want and passing them to patsy, skipping the string parsing step
from patsy import ModelDesc, Term, LookupFactor
response_terms = [Term([LookupFactor(response)])]
# start with intercept...
model_terms = [Term([])]
# ...then add another term for each candidate
model_terms += [Term([LookupFactor(c)]) for c in candidates]
model_desc = ModelDesc(response_terms, model_terms)
and now you can pass that model_desc
object into any function where you'd normally pass a patsy formula:
ols(model_desc, data).fit().rsquared_adj
There's another trick here: you'll notice that the first example has EvalFactor
objects, and now we're using LookupFactor
objects instead. The difference is that EvalFactor
takes a string of arbitrary Python code, which is nice if you want to write something like np.log(x1)
, but really annoying if you have variables with name like weight.in.kg
. LookupFactor
directly takes the name of a variable to look up in your data, so no further quoting is needed.
Alternatively, you could do this with some fancier Python string processing, like:
quoted = ["Q('{}')".format(c) for c in candidates]
formula = "{} ~ {} + 1".format(response, ' + '.join(quoted))
But while this is a bit simpler to start with, it's much more fragile -- for example, think about (or try) what happens if one of your parameters contains a quote character! You should never write something like this in a processing pipeline where the candidate names come from somewhere else that you can't control (e.g. a random CSV file) -- you could get all kinds of arbitrary code execution. The solution above avoids all of these problems.
Reference:
- https://patsy.readthedocs.io/en/latest/expert-model-specification.html
- https://patsy.readthedocs.io/en/latest/formulas.html
来源:https://stackoverflow.com/questions/38149482/using-ols-function-with-parameters-that-contain-numbers-spaces