feature-selection | 易学教程

SciKit-Learn Label Encoder resulting in error 'argument must be a string or number'

阅读更多关于 SciKit-Learn Label Encoder resulting in error 'argument must be a string or number'

问题 I'm a bit confused - creating an ML model here. I'm at the step where I'm trying to take categorical features from a "large" dataframe (180 columns) and one-hot them so that I can find the correlation between the features and select the "best" features. Here is my code: # import labelencoder from sklearn.preprocessing import LabelEncoder # instantiate labelencoder object le = LabelEncoder() # apply le on categorical feature columns df = df.apply(lambda col: le.fit_transform(col)) df.head(10)

sklearn Pipeline: argument of type 'ColumnTransformer' is not iterable

阅读更多关于 sklearn Pipeline: argument of type 'ColumnTransformer' is not iterable

问题 I am attempting to use a pipeline to feed an ensemble voting classifier as I want the ensemble learner to use models that train on different feature sets. For this purpose, I followed the tutorial available at [1]. Following is the code that I could develop so far. y = df1.index x = preprocessing.scale(df1) phy_features = ['A', 'B', 'C'] phy_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]) phy_processer = ColumnTransformer

sklearn Pipeline: argument of type 'ColumnTransformer' is not iterable

阅读更多关于 sklearn Pipeline: argument of type 'ColumnTransformer' is not iterable

All intermediate steps should be transformers and implement fit and transform

阅读更多关于 All intermediate steps should be transformers and implement fit and transform

问题 I am implementing a pipeline using important features selection and then using the same features to train my random forest classifier. Following is my code. m = ExtraTreesClassifier(n_estimators = 10) m.fit(train_cv_x,train_cv_y) sel = SelectFromModel(m, prefit=True) X_new = sel.transform(train_cv_x) clf = RandomForestClassifier(5000) model = Pipeline([('m', m),('sel', sel),('X_new', X_new),('clf', clf),]) params = {'clf__max_features': ['auto', 'sqrt', 'log2']} gs = GridSearchCV(model,

All intermediate steps should be transformers and implement fit and transform

阅读更多关于 All intermediate steps should be transformers and implement fit and transform

Recursive feature selection may not yield higher performance?

阅读更多关于 Recursive feature selection may not yield higher performance?

问题 I'm tring to analyze below data, modeled it with logistic regression first and then did the prediction, calculated the accuracy & auc; then performed recursive feature selection and calculated accuracy & auc again, thought the accuracy and auc would be higher, but actually they are both lower after the recursive feature selection, not sure whether it's expected? Or did I miss something? Thanks! Data: https://github.com/amandawang-dev/census-training/blob/master/census-training.csv for

How to compare feature selection regression-based algorithm with tree-based algorithms?

阅读更多关于 How to compare feature selection regression-based algorithm with tree-based algorithms?

问题 I'm trying to compare which feature selection model is more eficiente for a specific domain. Nowadays the state of the art in this domain (GWAS) is regression-based algorithms (LR, LMM, SAIGE, etc), but I want to give a try with tree-based algorithms (I'm using LightGBM LGBMClassifier with boosting_type='gbdt' as the cross-validation selected for me as most efficient one). I managed to get something like: Regression based alg --------------------- Features P-Values f1 2.49746e-21 f2 5.63324e

Put customized functions in Sklearn pipeline

阅读更多关于 Put customized functions in Sklearn pipeline

问题 In my classification scheme, there are several steps including: SMOTE (Synthetic Minority Over-sampling Technique) Fisher criteria for feature selection Standardization (Z-score normalisation) SVC (Support Vector Classifier) The main parameters to be tuned in the scheme above are percentile (2.) and hyperparameters for SVC (4.) and I want to go through grid search for tuning. The current solution builds a "partial" pipeline including step 3 and 4 in the scheme clf = Pipeline([('normal'

Put customized functions in Sklearn pipeline

阅读更多关于 Put customized functions in Sklearn pipeline

How to get the features selected by the RandomizedSearchCV for LGBMClassifier model?

阅读更多关于 How to get the features selected by the RandomizedSearchCV for LGBMClassifier model?

问题 I'm using the RandomizedSearchCV (sklearn) model selection to find out the best fit for a LightGBM LGBMClassifier model, but I'm facing issues to figure out which features has been selected for that. I can print out the the importance of each one by: lgbm_clf = lgbm.LGBMClassifier(boosting_type='gbdt',.... lgbm_clf.fit(X_train, y_train) importance_type = lgbm_clf.importance_type lgbm_clf.importance_type = "gain" gain = lgbm_clf.feature_importances_ lgbm_clf.importance_type = "split" split =