h2o DRF unseen categorical values handling

问题

The documentation for DRF states

What happens when you try to predict on a categorical level not seen during training? DRF converts a new categorical level to a NA value in the test set, and then splits left on the NA value during scoring. The algorithm splits left on NA values because, during training, NA values are grouped with the outliers in the left-most bin.

Questions:

So h2o converts unseen levels to NAs and then treats them the same way as NAs in the training data. But what if there are also no NAs in the training data?
Assume my categorical predictor is of enum type and to be understood as non-ordinal. What does "grouped with the outliers in the left-most bin" then mean? If the predictor is non-ordinal there is no "left-most" and there are no "outliers".
Let's put questions 1 and 2 aside and focus on the part "The algorithm splits left on NA values because, during training, NA values are grouped with the outliers in the left-most bin". This is in contradiction to this SO answer showing a single DRF tree derived from a MOJO. One can clearly see that NAs go left and right. It also contradicts the answer to another question in the documentation that says "missing values as a separate category [...] can go either left or right", see

How does the algorithm handle missing values during training? Missing values are interpreted as containing information (i.e., missing for a reason), rather than missing at random. During tree building, split decisions for every node are found by minimizing the loss function and treating missing values as a separate category that can go either left or right.

The last point is more of a suggestion than a question. The documentation on missing values for GBM says

What happens when you try to predict on a categorical level not seen during training? Unseen categorical levels are turned into NAs, and thus follow the same behavior as an NA. If there are no NAs in the training data, then unseen categorical levels in the test data follow the majority direction (the direction with the most observations). If there are NAs in the training data, then unseen categorical levels in the test data follow the direction that is optimal for the NAs of the training data.

In contrast to the description of how DRF handles missing values, this seems to be completely consistent. Plus: using the majority path rather than always going left at split points appears to be more natural.

回答1:

The sentence you pointed to that seemed to contradict other portions of the docs, is in fact outdated. I have made a Jira Ticket to update the FAQ with the correct answer (which is what you see for the GBM missing values section - i.e. the missing value handling is the same for GBM and DRF).

as a side note the enum data type are internally encoded as numeric values, you can learn more about the types of mapping's H2O can use here: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/categorical_encoding.html. For example, after the strings are mapped to integers for Enum, you can split {0, 1, 2, 3, 4, 5} as {0, 4, 5} and {1, 2, 3}.

Or take a look at how h2o-3 does binning for categoricals here: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm-faq/histograms_and_binning.html

来源：https://stackoverflow.com/questions/52965384/h2o-drf-unseen-categorical-values-handling

标签

random-forest

h2o