as_formula specifier for sklearn.tree.decisiontreeclassifier in Python?

问题

I was curious if there is an as_formula specifier (like in statsmodels) for sklearn.tree.decisiontreeclassifier in Python, or some way to hack one in. Currently, I must use

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)

but I would prefer to have something like

clf = clf.fit(formula='Y ~ X', data=df)

The reason is that I would like to specify more than one X without having to do a lot of array shaping. Thanks.

回答1:

It's currently not possible, but it would be great to have a patsy interface for scikit-learn. I don't think anyone is working on it at the moment, though.

回答2:

Thanks for the information. Although there is no current Patsy interface for sklearn, Patsy easily provides the functionality I need. As an example...

from sklearn import tree
from patsy import dmatrix

red = [1,0,0,0,0,1,1,0,0,1,1,0]
green = [0,0,0,1,0,1,1,0,0,1,1,0]
blue = [0,0,1,1,0,0,0,1,0,0,0,0]

y = [0,0,0,0,0,1,1,0,0,1,1,0]

X = dmatrix('red + green + blue + 0')

dt_clf = tree.DecisionTreeClassifier()
dt_clf = dt_clf.fit(X, y)

pred_r = [1,1,0,0,1,1,0,0,0,0,0,0]
pred_g = [1,1,0,0,1,1,0,0,0,0,0,0]
pred_b = [0,0,1,1,0,0,0,1,0,0,0,0]

test = dmatrix('pred_r + pred_g + pred_b + 0')
dt_clf.predict(test)

Perhaps even more convenient is the fact that sklearn plays well with pandas. Using the same data as above...

import pandas as pd

df = pd.DataFrame()
df['red'] = red
df['green'] = green
df['blue'] = blue
df['y'] = y

dt_clf = dt_clf.fit(df[['red','green','blue']], df['y'])
dt_clf.predict(test)

Hopefully this helps someone in the same situation as me.

note: be very careful that the sequence of Xs remains the same. For example, don't train as df[['red','green','blue']] then predict (df[['blue','green','red']]. May seem obvious, but an easy way to mess things up.

来源：https://stackoverflow.com/questions/31886700/as-formula-specifier-for-sklearn-tree-decisiontreeclassifier-in-python

标签

python

scikit-learn

decision-tree

patsy