问题
I am implementing the C4.5 algorithm in .net
, however I don't have clear idea of how it deals "continuous (numeric) data". Could someone give me a more detailed explanation?
回答1:
For continuous data C4.5 uses a threshold value where everything less than the threshold is in the left node, and everything greater than the threshold goes in the right node. The question is how to create that threshold value from the data you're given. The trick there is to sort your data by the continuous variable in ascending order. Then iterate over the data picking a threshold between data members. For example if your data for attribute x is:
0.5, 1.2, 3.4, 5.4, 6.0
You first pick a threshold between 0.5 and 1.2. In this case we can just use the average: 0.85. Now compute your impurity:
H(x < 0.85) = H(s) - l/N * H(x<0.85) - r/N * H(x>0.85).
Where l is the number of samples in the left node, r is the number of samples in the right node, and N is the total number of samples in the node being split. In our example above with x>0.85 as our split then l=1, r=4, and N=5.
Remember the computed impurity difference, and now compute it for the split between 2 and 3 (ie x>2.3). Repeat that for every split (ie n-1 splits). Then pick the split that minimized H the most. That means your split should be more pure than not splitting. If you can't increase the purity for the resulting nodes then don't split it. You can also have a minimum node size so you don't end up with the left or right nodes containing only one sample in them.
来源:https://stackoverflow.com/questions/15629398/how-does-the-c4-5-algorithm-handle-continuous-data