问题
I'm doing a Random Forest implementation (for classification), and I have some questions regarding the tree growing algorithm mentioned in literature.
When training a decision tree, there are 2 criteria to stop growing a tree:
a. Stop when there are no more features left to split a node on.
b. Stop when the node has all samples in it belonging to the same class.
Based on that,
1. Consider growing one tree in the forest. When splitting a node of the tree, I randomly select m of the M total features, and then from these m features I find that one feature with maximum information gain. After I've found this one feature, say f, should I remove this feature from the feature list, before proceeding down to the children of the node? If I don't remove this feature, then this feature might get selected again down the tree.
If I implement the algorithm without removing the feature selected at a node, then the only way to stop growing the tree is when the leaves of the tree become "pure". When I did this, I got the "maximum recursion depth" reached error in Python, because the tree couldn't reach that "pure" condition earlier.
The RF literature even those written by Breiman say that the tree should be grown to the maximum . What does this mean?
2. At a node split, after selecting the best feature to split on (by information gain), what should be the threshold on which to split? One approach is to have no threshold, create one child node for every unique value of the feature; but I've continuous-valued features too, so that means creating one child node per sample!
回答1:
Q1
- You shouldn't remove the features from M. Otherwise it will not be able to detect some types of relationships (ex: linear relationships)
- Maybe you can stop earlier, in your condition it might go up to leaves with only 1 sample, this will have no statistical significance. So it's better to stop at say, when the number of samples at leaf is <= 3
Q2
- For continuous features maybe you can bin them to groups and use them to figure out a splitting point.
来源:https://stackoverflow.com/questions/36081370/random-forest-tree-growing-algorithm