How does the C4.5 Algorithm handle continuous data?

五迷三道 提交于 2019-12-21 05:25:11

问题


I am implementing the C4.5 algorithm in .net, however I don't have clear idea of how it deals "continuous (numeric) data". Could someone give me a more detailed explanation?


回答1:


For continuous data C4.5 uses a threshold value where everything less than the threshold is in the left node, and everything greater than the threshold goes in the right node. The question is how to create that threshold value from the data you're given. The trick there is to sort your data by the continuous variable in ascending order. Then iterate over the data picking a threshold between data members. For example if your data for attribute x is:

0.5, 1.2, 3.4, 5.4, 6.0

You first pick a threshold between 0.5 and 1.2. In this case we can just use the average: 0.85. Now compute your impurity:

H(x < 0.85) = H(s) - l/N * H(x<0.85) - r/N * H(x>0.85).

Where l is the number of samples in the left node, r is the number of samples in the right node, and N is the total number of samples in the node being split. In our example above with x>0.85 as our split then l=1, r=4, and N=5.

Remember the computed impurity difference, and now compute it for the split between 2 and 3 (ie x>2.3). Repeat that for every split (ie n-1 splits). Then pick the split that minimized H the most. That means your split should be more pure than not splitting. If you can't increase the purity for the resulting nodes then don't split it. You can also have a minimum node size so you don't end up with the left or right nodes containing only one sample in them.



来源:https://stackoverflow.com/questions/15629398/how-does-the-c4-5-algorithm-handle-continuous-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!