Using ols function with parameters that contain numbers/spaces

寵の児 提交于 2021-02-07 18:14:32

问题


I am having a lot of difficulty using the statsmodels.formula.api function

       ols(formula,data).fit().rsquared_adj 

due to the nature of the names of my predictors. The predictors have numbers and spaces etc in them which it clearly doesn't like. I understand that I need to use something like patsy.builtins.Q So lets say my predictor would be weight.in.kg , it should be entered as follows:

Q("weight.in.kg")

so I need to take my formula from a list, and the difficulty arises in modifying every item in the list with this patsy.builtin.Q

formula = "{} ~ {} + 1".format(response, ' + '.join([candidate])

with [candidate] being my list of predictors.

My question to you, dearest python experts, is how on earth do I put every individual item in the list [candidate] within the quotes in the following expression:

Q('')

so that the ols function can actually read it? Apologies if this is super obvious, me no good at python.


回答1:


Right now you're starting with a list of terms that you want in your formula, then trying to paste them together into a complicated string, which patsy will parse and convert back into a list of terms. You can see the data structure that patsy generates for this kind of formula (ModelDesc.from_formula is patsy's parser):

In [7]: from patsy import ModelDesc

In [8]: ModelDesc.from_formula("y ~ x1 + x2 + x3")
Out[8]: 
ModelDesc(lhs_termlist=[Term([EvalFactor('y')])],
          rhs_termlist=[Term([]),
                        Term([EvalFactor('x1')]),
                        Term([EvalFactor('x2')]),
                        Term([EvalFactor('x3')])])

This might look a little intimidating, but it's pretty simple really -- you have a ModelDesc, which represents a single formula, and it has a left-hand-side list of terms and a right-hand-side list of terms. Each term is represented by a Term object, and each Term has a list of factors. (Here each term just has a single factor -- if you had any interactions then those terms would have multiple factors.) Also, the "empty interaction" Term([]) is how patsy represents the intercept term.

So you can avoid all this complicated quoting/parsing stuff by directly creating the terms you want and passing them to patsy, skipping the string parsing step

from patsy import ModelDesc, Term, LookupFactor

response_terms = [Term([LookupFactor(response)])]
# start with intercept...
model_terms = [Term([])]
# ...then add another term for each candidate
model_terms += [Term([LookupFactor(c)]) for c in candidates]
model_desc = ModelDesc(response_terms, model_terms)

and now you can pass that model_desc object into any function where you'd normally pass a patsy formula:

ols(model_desc, data).fit().rsquared_adj

There's another trick here: you'll notice that the first example has EvalFactor objects, and now we're using LookupFactor objects instead. The difference is that EvalFactor takes a string of arbitrary Python code, which is nice if you want to write something like np.log(x1), but really annoying if you have variables with name like weight.in.kg. LookupFactor directly takes the name of a variable to look up in your data, so no further quoting is needed.

Alternatively, you could do this with some fancier Python string processing, like:

quoted = ["Q('{}')".format(c) for c in candidates]
formula = "{} ~ {} + 1".format(response, ' + '.join(quoted))

But while this is a bit simpler to start with, it's much more fragile -- for example, think about (or try) what happens if one of your parameters contains a quote character! You should never write something like this in a processing pipeline where the candidate names come from somewhere else that you can't control (e.g. a random CSV file) -- you could get all kinds of arbitrary code execution. The solution above avoids all of these problems.

Reference:

  • https://patsy.readthedocs.io/en/latest/expert-model-specification.html
  • https://patsy.readthedocs.io/en/latest/formulas.html


来源:https://stackoverflow.com/questions/38149482/using-ols-function-with-parameters-that-contain-numbers-spaces

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!