How to deal with missing attribute values in C4.5 (J48) decision tree?

你离开我真会死。 提交于 2019-12-09 12:44:48

问题


What's the best way to handle missing feature attribute values with Weka's C4.5 (J48) decision tree? The problem of missing values occurs during both training and classification.

  1. If values are missing from training instances, am I correct in assuming that I place a '?' value for the feature?

  2. Suppose that I am able to successfully build the decision tree and then create my own tree code in C++ or Java from Weka's tree structure. During classification time, if I am trying to classify a new instance, what value do I put for features that have missing values? How would I descend the tree past a decision node for which I have an unknown value?

Would using Naive Bayes be better for handling missing values? I would just assign a very small non-zero probability for them, right?


回答1:


From Pedro Domingos' ML course in University of Washington:

Here are three approaches what Pedro suggests for missing value of A:

  • Assign most common value of A among other examples sorted to node n
  • Assign most common value of A among other examples with same target value
  • Assign probability p_i to each possible value v_i of A; Assign fraction p_i of example to each descendant in tree.

The slides and video is now viewable at here.




回答2:


An alternative approach is to leave the missing value as the '?', and not use it for the information gain calculation. No node should have an unknown value during classification because you ignored it during the information gain step. For classifying, I believe you simply consider the missing value unknown and do not delete it during classification on that specific attribute.



来源:https://stackoverflow.com/questions/13425722/how-to-deal-with-missing-attribute-values-in-c4-5-j48-decision-tree

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!