问题
I am running H2O isolation forest on a dataset without labels in R to detect outliers. It’s impossible for me to get the labels for my data. There are categorical features in my dataset.
Basically, what I am doing is I use the same dataset to train the model and predict the anomaly scores, and then arbitrarily pick the top 1% with the largest anomaly scores and the shortest lengths as the outliers. This is my code.
seed <- 12345
ntrees <- 100
max_depth <- 8 # default is 8
sample_size <- 256 # default is 256
isoforest <- h2o.isolationForest(training_frame=dataset.hex, ntrees=ntrees, seed=seed)
score <- h2o.predict(isoforest, dataset.hex)
quantile_thres <- 0.99
quantile_frame <- quantile(score, probs=quantile_thres)
quantile_frame <- as.data.frame(quantile_frame)
threshold <- quantile_frame[1,]
score$predicted_class <- score$predict>threshold
I have two questions:
(1) Is what I did a proper way to identifier outliers in my dataset?
(2) I would like to tune the hyperparameters (e.g., ntress, max_depth and sample_size) to improve the model performance on my dataset. Is there a metric in the model output for me to know which model is better? I found that MSE and RMSE are NaN when I checked model performance.
My computer: OS X 10.14.6, 16 GB memory
H2O cluster version: 3.30.0.1
H2O cluster total nodes: 1
H2O cluster total memory: 15.00 GB
H2O cluster total cores: 16
H2O cluster allowed cores: 16
H2O cluster healthy: TRUE
R Version: R version 3.6.3 (2020-02-29)
Please let me know if there is any other information I can provide. Thanks for your help!
来源:https://stackoverflow.com/questions/62516100/is-there-a-metric-in-h2o-isolation-forest-to-measure-model-performance-on-datase