Method of finding threshold in Decision tree for continuous data

别说谁变了你拦得住时间么 提交于 2020-01-17 05:02:20

问题


I am using decision tree in Weka and I have some continuous data, so when I use Weka it automatically find the threshold for me but for some reason I want to implement Decision Tree by myself so I need to know what approach to use to find the threshold to discretize my continuous data?


回答1:


ID3 and C4.5 use entropy heuristic for discretization of continuous data. The method finds a binary cut for each variable (feature). You could apply the same method recursively to get multiple intervals from continuous data.

Suppose at a certain tree node, all instances belong to a set of S, and you are working on variable A and a particular boundary (cut) T, the class information entropy of the partition induced by T, denoted as E(A,T,S) is given by:

             |S1|                 |S2|
E(A, T, S) = ---- Entropy(S1) +   ---- Entropy(S2)
              |S|                 |S|

where |S1| is the number of instances in the first partition; |S2| is the number of instances in the second partition; |S| = |S1|+|S2|.

For a given feature A, the boundary T_min which minimizes the entropy function over all possible partition boundaries, is selected as a binary discretization boundary.

For example, you might have a variable Length, with all possible values as:

Length = {2.1, 2.8, 3.5, 8.0, 10.0, 20.0, 50.0, 51.0}

Then your T could be:

T = {2.1, 2.8, 3.5, 8.0, 10.0, 20.0, 50.0, 51.0}

in which you cut at every possible Length value. You could also cut at every middle point of adjacent Length values, e.g.,

T = {2.45, 3.15, 5.75, 9.0, 15.0, 35.0, 50.5}

At discretization time, you will iterate through all possible T values and evaluate which one obtains the minimum E(A, T, S). That's it.

See more details in this paper, which also describes other optional methods:

  • ChiMerge Discretization method.
  • Learning Vector Quantization (LVQ) based method
  • Histogram-based method.


来源:https://stackoverflow.com/questions/33501039/method-of-finding-threshold-in-decision-tree-for-continuous-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!