feature-selection

SciKit-Learn Label Encoder resulting in error 'argument must be a string or number'

十年热恋 提交于 2020-06-14 05:03:10
问题 I'm a bit confused - creating an ML model here. I'm at the step where I'm trying to take categorical features from a "large" dataframe (180 columns) and one-hot them so that I can find the correlation between the features and select the "best" features. Here is my code: # import labelencoder from sklearn.preprocessing import LabelEncoder # instantiate labelencoder object le = LabelEncoder() # apply le on categorical feature columns df = df.apply(lambda col: le.fit_transform(col)) df.head(10)

sklearn Pipeline: argument of type 'ColumnTransformer' is not iterable

最后都变了- 提交于 2020-06-01 05:07:32
问题 I am attempting to use a pipeline to feed an ensemble voting classifier as I want the ensemble learner to use models that train on different feature sets. For this purpose, I followed the tutorial available at [1]. Following is the code that I could develop so far. y = df1.index x = preprocessing.scale(df1) phy_features = ['A', 'B', 'C'] phy_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]) phy_processer = ColumnTransformer

sklearn Pipeline: argument of type 'ColumnTransformer' is not iterable

痞子三分冷 提交于 2020-06-01 05:07:27
问题 I am attempting to use a pipeline to feed an ensemble voting classifier as I want the ensemble learner to use models that train on different feature sets. For this purpose, I followed the tutorial available at [1]. Following is the code that I could develop so far. y = df1.index x = preprocessing.scale(df1) phy_features = ['A', 'B', 'C'] phy_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]) phy_processer = ColumnTransformer

All intermediate steps should be transformers and implement fit and transform

核能气质少年 提交于 2020-05-25 07:54:14
问题 I am implementing a pipeline using important features selection and then using the same features to train my random forest classifier. Following is my code. m = ExtraTreesClassifier(n_estimators = 10) m.fit(train_cv_x,train_cv_y) sel = SelectFromModel(m, prefit=True) X_new = sel.transform(train_cv_x) clf = RandomForestClassifier(5000) model = Pipeline([('m', m),('sel', sel),('X_new', X_new),('clf', clf),]) params = {'clf__max_features': ['auto', 'sqrt', 'log2']} gs = GridSearchCV(model,

All intermediate steps should be transformers and implement fit and transform

混江龙づ霸主 提交于 2020-05-25 07:54:06
问题 I am implementing a pipeline using important features selection and then using the same features to train my random forest classifier. Following is my code. m = ExtraTreesClassifier(n_estimators = 10) m.fit(train_cv_x,train_cv_y) sel = SelectFromModel(m, prefit=True) X_new = sel.transform(train_cv_x) clf = RandomForestClassifier(5000) model = Pipeline([('m', m),('sel', sel),('X_new', X_new),('clf', clf),]) params = {'clf__max_features': ['auto', 'sqrt', 'log2']} gs = GridSearchCV(model,

Recursive feature selection may not yield higher performance?

假装没事ソ 提交于 2020-05-09 07:53:28
问题 I'm tring to analyze below data, modeled it with logistic regression first and then did the prediction, calculated the accuracy & auc; then performed recursive feature selection and calculated accuracy & auc again, thought the accuracy and auc would be higher, but actually they are both lower after the recursive feature selection, not sure whether it's expected? Or did I miss something? Thanks! Data: https://github.com/amandawang-dev/census-training/blob/master/census-training.csv for

How to compare feature selection regression-based algorithm with tree-based algorithms?

十年热恋 提交于 2020-04-16 02:47:07
问题 I'm trying to compare which feature selection model is more eficiente for a specific domain. Nowadays the state of the art in this domain (GWAS) is regression-based algorithms (LR, LMM, SAIGE, etc), but I want to give a try with tree-based algorithms (I'm using LightGBM LGBMClassifier with boosting_type='gbdt' as the cross-validation selected for me as most efficient one). I managed to get something like: Regression based alg --------------------- Features P-Values f1 2.49746e-21 f2 5.63324e

Put customized functions in Sklearn pipeline

给你一囗甜甜゛ 提交于 2020-04-10 03:36:07
问题 In my classification scheme, there are several steps including: SMOTE (Synthetic Minority Over-sampling Technique) Fisher criteria for feature selection Standardization (Z-score normalisation) SVC (Support Vector Classifier) The main parameters to be tuned in the scheme above are percentile (2.) and hyperparameters for SVC (4.) and I want to go through grid search for tuning. The current solution builds a "partial" pipeline including step 3 and 4 in the scheme clf = Pipeline([('normal'

Put customized functions in Sklearn pipeline

邮差的信 提交于 2020-04-10 03:36:06
问题 In my classification scheme, there are several steps including: SMOTE (Synthetic Minority Over-sampling Technique) Fisher criteria for feature selection Standardization (Z-score normalisation) SVC (Support Vector Classifier) The main parameters to be tuned in the scheme above are percentile (2.) and hyperparameters for SVC (4.) and I want to go through grid search for tuning. The current solution builds a "partial" pipeline including step 3 and 4 in the scheme clf = Pipeline([('normal'

How to get the features selected by the RandomizedSearchCV for LGBMClassifier model?

蹲街弑〆低调 提交于 2020-03-23 08:17:49
问题 I'm using the RandomizedSearchCV (sklearn) model selection to find out the best fit for a LightGBM LGBMClassifier model, but I'm facing issues to figure out which features has been selected for that. I can print out the the importance of each one by: lgbm_clf = lgbm.LGBMClassifier(boosting_type='gbdt',.... lgbm_clf.fit(X_train, y_train) importance_type = lgbm_clf.importance_type lgbm_clf.importance_type = "gain" gain = lgbm_clf.feature_importances_ lgbm_clf.importance_type = "split" split =