问题
I was curious if there is an as_formula specifier (like in statsmodels
) for sklearn.tree.decisiontreeclassifier
in Python, or some way to hack one in. Currently, I must use
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
but I would prefer to have something like
clf = clf.fit(formula='Y ~ X', data=df)
The reason is that I would like to specify more than one X without having to do a lot of array shaping. Thanks.
回答1:
It's currently not possible, but it would be great to have a patsy interface for scikit-learn. I don't think anyone is working on it at the moment, though.
回答2:
Thanks for the information. Although there is no current Patsy
interface for sklearn
, Patsy
easily provides the functionality I need. As an example...
from sklearn import tree
from patsy import dmatrix
red = [1,0,0,0,0,1,1,0,0,1,1,0]
green = [0,0,0,1,0,1,1,0,0,1,1,0]
blue = [0,0,1,1,0,0,0,1,0,0,0,0]
y = [0,0,0,0,0,1,1,0,0,1,1,0]
X = dmatrix('red + green + blue + 0')
dt_clf = tree.DecisionTreeClassifier()
dt_clf = dt_clf.fit(X, y)
pred_r = [1,1,0,0,1,1,0,0,0,0,0,0]
pred_g = [1,1,0,0,1,1,0,0,0,0,0,0]
pred_b = [0,0,1,1,0,0,0,1,0,0,0,0]
test = dmatrix('pred_r + pred_g + pred_b + 0')
dt_clf.predict(test)
Perhaps even more convenient is the fact that sklearn
plays well with pandas
. Using the same data as above...
import pandas as pd
df = pd.DataFrame()
df['red'] = red
df['green'] = green
df['blue'] = blue
df['y'] = y
dt_clf = dt_clf.fit(df[['red','green','blue']], df['y'])
dt_clf.predict(test)
Hopefully this helps someone in the same situation as me.
note: be very careful that the sequence of Xs remains the same. For example, don't train as df[['red','green','blue']] then predict (df[['blue','green','red']]. May seem obvious, but an easy way to mess things up.
来源:https://stackoverflow.com/questions/31886700/as-formula-specifier-for-sklearn-tree-decisiontreeclassifier-in-python