Is there a metric in H2O isolation forest to measure model performance on datasets without labels?

问题

I am running H2O isolation forest on a dataset without labels in R to detect outliers. It’s impossible for me to get the labels for my data. There are categorical features in my dataset.

Basically, what I am doing is I use the same dataset to train the model and predict the anomaly scores, and then arbitrarily pick the top 1% with the largest anomaly scores and the shortest lengths as the outliers. This is my code.

seed <- 12345

ntrees <- 100

max_depth <- 8        # default is 8

sample_size <- 256  # default is 256

isoforest <- h2o.isolationForest(training_frame=dataset.hex, ntrees=ntrees, seed=seed)

 

score <- h2o.predict(isoforest, dataset.hex)

 

quantile_thres <- 0.99

quantile_frame <- quantile(score, probs=quantile_thres)

quantile_frame <- as.data.frame(quantile_frame)

 

threshold <- quantile_frame[1,]

score$predicted_class <- score$predict>threshold

I have two questions:

(1) Is what I did a proper way to identifier outliers in my dataset?

(2) I would like to tune the hyperparameters (e.g., ntress, max_depth and sample_size) to improve the model performance on my dataset. Is there a metric in the model output for me to know which model is better? I found that MSE and RMSE are NaN when I checked model performance.

My computer: OS X 10.14.6, 16 GB memory

H2O cluster version:        3.30.0.1

H2O cluster total nodes:    1

H2O cluster total memory:   15.00 GB

H2O cluster total cores:    16

H2O cluster allowed cores:  16

H2O cluster healthy:        TRUE

R Version:                  R version 3.6.3 (2020-02-29)

Please let me know if there is any other information I can provide. Thanks for your help!

来源：https://stackoverflow.com/questions/62516100/is-there-a-metric-in-h2o-isolation-forest-to-measure-model-performance-on-datase

标签

machine-learning

h2o