问题
I'm trying to compare which feature selection model is more eficiente for a specific domain. Nowadays the state of the art in this domain (GWAS) is regression-based algorithms (LR, LMM, SAIGE, etc), but I want to give a try with tree-based algorithms (I'm using LightGBM LGBMClassifier
with boosting_type='gbdt'
as the cross-validation selected for me as most efficient one).
I managed to get something like:
Regression based alg
---------------------
Features P-Values
f1 2.49746e-21
f2 5.63324e-08
f3 9.78003e-13
... ...
Tree based (gain/split)
---------------------
Features gain/split
f1 12
f2 10
f3 8
... ...
How to compare them? I was considering to make use of the features selected by each model to predict my outcome (binary) variable and see which one is better (f1-score, recall, precision, accuracy, etc), but it brings me to another question: Ok, for regression I can set a alpha for my p-values and get a specific amount of features to be used in the prediction model, but what is the gain/split metric in LightGBM? I mean, how to select which features are relevant and should be use for my predictive model? Is it possible to, some how, tell LightGBM: Give me the features you've used in the train/test model?
I realised when I try: lgbm.plot_split_value_histogram(lgbm_clf, 'f3')
it says: Cannot plot split value histogram, because feature f3 was not used in splitting
. So, how do I get the features used in the model?
Thanks!
来源:https://stackoverflow.com/questions/60417647/how-to-compare-feature-selection-regression-based-algorithm-with-tree-based-algo