问题
I am following the H2O example to run target mean encoding in Sparking Water (sparking water 2.4.2 and H2O 3.22.04). It runs well in all the following paragraph
from h2o.targetencoder import TargetEncoder
# change label to factor
input_df_h2o['label'] = input_df_h2o['label'].asfactor()
# add fold column for Target Encoding
input_df_h2o["cv_fold_te"] = input_df_h2o.kfold_column(n_folds = 5, seed = 54321)
# find all categorical features
cat_features = [k for (k,v) in input_df_h2o.types.items() if v in ('string')]
# convert string to factor
for i in cat_features:
input_df_h2o[i] = input_df_h2o[i].asfactor()
# target mean encode
targetEncoder = TargetEncoder(x= cat_features, y = y, fold_column = "cv_fold_te", blending_avg=True)
targetEncoder.fit(input_df_h2o)
But when I start to use the same data set used to fit Target Encoder to run the transform code (see code below):
ext_input_df_h2o = targetEncoder.transform(frame=input_df_h2o,
holdout_type="kfold", # mean is calculating on out-of-fold data only; loo means leave one out
is_train_or_valid=True,
noise = 0, # determines if random noise should be added to the target average
seed=54321)
I will have error like
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-6773422589366407956.py", line 331, in <module>
exec(code)
File "<stdin>", line 5, in <module>
File "/usr/lib/envs/env-1101-ver-1619-a-4.2.9-py-3.5.3/lib/python3.5/site-packages/h2o/targetencoder.py", line 97, in transform
assert self._encodingMap.map_keys['string'] == self._teColumns
AssertionError
I found the code in its source code http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/_modules/h2o/targetencoder.html but how to fix this issue? It is the same table used to run the fit.
回答1:
The issue is because you are trying encoding multiple categorical features. I think that is a bug of H2O, but you can solve putting the transformer in a for loop that iterate over all categorical names.
import numpy as np
import pandas as pd
import h2o
from h2o.targetencoder import TargetEncoder
h2o.init()
df = pd.DataFrame({
'x_0': ['a'] * 5 + ['b'] * 5,
'x_1': ['c'] * 9 + ['d'] * 1,
'x_2': ['a'] * 3 + ['b'] * 7,
'y_0': [1, 1, 1, 1, 0, 1, 0, 0, 0, 0]
})
hf = h2o.H2OFrame(df)
hf['cv_fold_te'] = hf.kfold_column(n_folds=2, seed=54321)
hf['y_0'] = hf['y_0'].asfactor()
cat_features = ['x_0', 'x_1', 'x_2']
for item in cat_features:
target_encoder = TargetEncoder(x=[item], y='y_0', fold_column = 'cv_fold_te')
target_encoder.fit(hf)
hf = target_encoder.transform(frame=hf, holdout_type='kfold',
seed=54321, noise=0.0)
hf
回答2:
Thanks everyone for letting us know. Assertion was a precaution as I was not sure whether there could be the case that order could be changed. Rest of the code was written with this assumption in mind and therefore safe to use with changed order anyway, but assertion was left and forgotten. Added test and removed assertion. Now this issue is fixed and merged. Should be available in the upcoming fix release. 0xdata.atlassian.net/browse/PUBDEV-6474
来源:https://stackoverflow.com/questions/55306686/h2o-target-mean-encoder-frames-are-being-sent-in-the-same-order-error