How to compare feature selection regression-based algorithm with tree-based algorithms?

问题

I'm trying to compare which feature selection model is more eficiente for a specific domain. Nowadays the state of the art in this domain (GWAS) is regression-based algorithms (LR, LMM, SAIGE, etc), but I want to give a try with tree-based algorithms (I'm using LightGBM LGBMClassifier with boosting_type='gbdt' as the cross-validation selected for me as most efficient one). I managed to get something like:

Regression based alg
---------------------
Features    P-Values
f1          2.49746e-21
f2          5.63324e-08
f3          9.78003e-13
...         ...

Tree based (gain/split)
---------------------
Features    gain/split
f1          12
f2          10
f3          8
...         ...

How to compare them? I was considering to make use of the features selected by each model to predict my outcome (binary) variable and see which one is better (f1-score, recall, precision, accuracy, etc), but it brings me to another question: Ok, for regression I can set a alpha for my p-values and get a specific amount of features to be used in the prediction model, but what is the gain/split metric in LightGBM? I mean, how to select which features are relevant and should be use for my predictive model? Is it possible to, some how, tell LightGBM: Give me the features you've used in the train/test model? I realised when I try: lgbm.plot_split_value_histogram(lgbm_clf, 'f3') it says: Cannot plot split value histogram, because feature f3 was not used in splitting. So, how do I get the features used in the model?

Thanks!

来源：https://stackoverflow.com/questions/60417647/how-to-compare-feature-selection-regression-based-algorithm-with-tree-based-algo

标签

python

statistics

regression

feature-selection

lightgbm