问题
I was playing around with weka when I observed a minNum field in the RandomTree configuration. I read the description which said "The minimum total weight of the instances in a leaf". However, I couldn't really understand what it means.
I played around with that number, and I realized that when I increase it, the size of the tree thus generated reduces. I couldn't correlate as to why this is happening.
Any help/references will be appreciated.
回答1:
This has to do with the minimum number of instances on a leaf node (which is often 2 by default in decision trees, like J48). The higher you set this parameter, the more general the tree will be since having many leaves with a low number of instances yields a too granular tree structure.
Here are two examples on the iris
dataset, which shows how the -M
option might affect size of the resulting tree:
$ weka weka.classifiers.trees.RandomTree -t iris.arff -i
petallength < 2.45 : Iris-setosa (50/0)
petallength >= 2.45
| petalwidth < 1.75
| | petallength < 4.95
| | | petalwidth < 1.65 : Iris-versicolor (47/0)
| | | petalwidth >= 1.65 : Iris-virginica (1/0)
| | petallength >= 4.95
| | | petalwidth < 1.55 : Iris-virginica (3/0)
| | | petalwidth >= 1.55
| | | | sepallength < 6.95 : Iris-versicolor (2/0)
| | | | sepallength >= 6.95 : Iris-virginica (1/0)
| petalwidth >= 1.75
| | petallength < 4.85
| | | sepallength < 5.95 : Iris-versicolor (1/0)
| | | sepallength >= 5.95 : Iris-virginica (2/0)
| | petallength >= 4.85 : Iris-virginica (43/0)
Size of the tree : 17
$ weka weka.classifiers.trees.RandomTree -M 6 -t iris.arff -i
petallength < 2.45 : Iris-setosa (50/0)
petallength >= 2.45
| petalwidth < 1.75
| | petallength < 4.95
| | | petalwidth < 1.65 : Iris-versicolor (47/0)
| | | petalwidth >= 1.65 : Iris-virginica (1/0)
| | petallength >= 4.95 : Iris-virginica (6/2)
| petalwidth >= 1.75
| | petallength < 4.85 : Iris-virginica (3/1)
| | petallength >= 4.85 : Iris-virginica (43/0)
Size of the tree : 11
As a sidenote, Random trees rely on bagging, which means there's a subsampling of attributes (K randomly chosen to split at each node); contrary to REPTree, however, there's no pruning (like in RandomForest), so you may end up with very noisy trees.
来源:https://stackoverflow.com/questions/4845812/regarding-randomtree-in-weka