ID3 and C4.5: How Does “Gain Ratio” Normalize “Gain”?

独自空忆成欢 提交于 2020-01-01 03:39:30

问题


The ID3 algorithm uses "Information Gain" measure.

The C4.5 uses "Gain Ratio" measure which is Information Gain divided by SplitInfo, whereas SplitInfo is high for a split where records split evenly between different outcomes and low otherwise.

My question is:

How does this help to solve the problem that Information Gain is biased towards splits with many outcomes? I can't see the reason. SplitInfo doesn't even take into account the number of outcomes, just the distribution of records in the split.

It may very well be that there is a low number of outcomes (say 2), and the records are split evenly between those 2 outcomes. In that case, SplitInfo is high, Gain Ratio is low, and a split with few outcomes is less likely to be chosen by C4.5.

On the other hand, it may be that there is a low number of outcomes, but the distribution is far from even. In that case, SplitInfo is low, Gain Ratio is high, and a split with many outcomes is more likely to be chosen.

What am I missing?


回答1:


SplitInfo doesn't even take into account the number of outcomes, just the distribution of records in the split.

But it does take the number of outcomes into account. (Even if it is also dependent on distribution, as you noted). Your comparison is between two situations with the same ("low") number of outcomes, so it can't possibly illustrate how SplitInfo changes with a changing number of outcomes.

Consider the following 3 situations, all with even distribution for simplicity of comparison:

  • 10 possible outcomes with even distribution

    SplitInfo = -10*(1/10*log2(1/10)) = 3.32

  • 100 possible outcomes with even distribution

    SplitInfo = -100*(1/100*log2(1/100)) = 6.64

  • 1000 possible outcomes with even distribution

    SplitInfo = -1000*(1/1000*log2(1/1000)) = 9.97

So if you have to choose between 3 possible splitting scenarios, using only Information Gain as in ID3, the latter would be chosen. However, using SplitInfo in the GainRatio, it should be clear that as the number of choices goes up, the SplitInfo will also go up, and the GainRatio will go down.

All of that was explained with an assumption of evenly distributed splits. However, even with non-uniform distribution, the above will still hold true. SplitInfo will get higher as number of possible outcomes gets higher. Yes, if we hold number of possible outcomes constant and vary outcome distribution, then SplitInfo will have some variance... but so will Information Gain.



来源:https://stackoverflow.com/questions/13224649/id3-and-c4-5-how-does-gain-ratio-normalize-gain

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!