问题
I am using decision tree in Weka and I have some continuous data, so when I use Weka it automatically find the threshold for me but for some reason I want to implement Decision Tree by myself so I need to know what approach to use to find the threshold to discretize my continuous data?
回答1:
ID3 and C4.5 use entropy heuristic for discretization of continuous data. The method finds a binary cut for each variable (feature). You could apply the same method recursively to get multiple intervals from continuous data.
Suppose at a certain tree node, all instances belong to a set of S
, and you are working on variable A
and a particular boundary (cut) T
, the class information entropy of the partition induced by T
, denoted as E(A,T,S)
is given by:
|S1| |S2|
E(A, T, S) = ---- Entropy(S1) + ---- Entropy(S2)
|S| |S|
where |S1|
is the number of instances in the first partition; |S2|
is the number of instances in the second partition; |S| = |S1|+|S2|
.
For a given feature A
, the boundary T_min
which minimizes the entropy function over all possible partition boundaries, is selected as a binary discretization boundary.
For example, you might have a variable Length
, with all possible values as:
Length = {2.1, 2.8, 3.5, 8.0, 10.0, 20.0, 50.0, 51.0}
Then your T
could be:
T = {2.1, 2.8, 3.5, 8.0, 10.0, 20.0, 50.0, 51.0}
in which you cut at every possible Length
value. You could also cut at every middle point of adjacent Length
values, e.g.,
T = {2.45, 3.15, 5.75, 9.0, 15.0, 35.0, 50.5}
At discretization time, you will iterate through all possible T
values and evaluate which one obtains the minimum E(A, T, S)
. That's it.
See more details in this paper, which also describes other optional methods:
- ChiMerge Discretization method.
- Learning Vector Quantization (LVQ) based method
- Histogram-based method.
来源:https://stackoverflow.com/questions/33501039/method-of-finding-threshold-in-decision-tree-for-continuous-data