feature-selection

How do I found the lowest regularization parameter (C) using Randomized Logistic Regression in scikit-learn?

跟風遠走 提交于 2020-01-01 18:17:33
问题 I'm trying to use the scikit-learn Randomized Logistic Regression feature selection method but I keep running into cases where it kills all the features while fitting, and returns: ValueError: Found array with 0 feature(s) (shape=(777, 0)) while a minimum of 1 is required. This is as expected, clearly, because I'm reducing the regularization parameter - C - to ridiculously low levels (note that this is the inverse of the mathematical regularization parameter lambda , i.e., C = 1/lambda so the

apache spark MLLib: how to build labeled points for string features?

ぃ、小莉子 提交于 2020-01-01 07:38:19
问题 I am trying to build a NaiveBayes classifier with Spark's MLLib which takes as input a set of documents. I'd like to put some things as features (i.e. authors, explicit tags, implicit keywords, category), but looking at the documentation it seems that a LabeledPoint contains only doubles, i.e it looks like LabeledPoint[Double, List[Pair[Double,Double]] . Instead what I have as output from the rest of my code would be something like LabeledPoint[Double, List[Pair[String,Double]] . I could make

How does sklearn random forest index feature_importances_

和自甴很熟 提交于 2020-01-01 05:17:08
问题 I have used the RandomForestClassifier in sklearn for determining the important features in my dataset. How am I able to return the actual feature names (my variables are labeled x1, x2, x3, etc.) rather than their relative name (it tells me the important features are '12', '22', etc.). Below is the code that I am currently using to return the important features. important_features = [] for x,i in enumerate(rf.feature_importances_): if i>np.average(rf.feature_importances_): important_features

best-found PCA estimator to be used as the estimator in RFECV

老子叫甜甜 提交于 2019-12-31 06:59:05
问题 This works (mostly from the demo sample at sklearn): print(__doc__) # Code source: Gaël Varoquaux # Modified for documentation by Jaques Grobler # License: BSD 3 clause import numpy as np import matplotlib.pyplot as plt from sklearn import linear_model, decomposition, datasets from sklearn.pipeline import Pipeline from sklearn.model_selection import GridSearchCV from scipy.stats import uniform lregress = LinearRegression() pca = decomposition.PCA() pipe = Pipeline(steps=[('pca', pca), (

How to retrieve coefficient names after label encoding and one hot encoding on scikit-learn?

旧时模样 提交于 2019-12-31 03:29:05
问题 I am running a machine learning model (Ridge Regression w/ Cross-Validation) using scikit-learn's RidgeCV() method. My data set has 5 categorical features and 2 numerical ones, so I started with LabelEncoder() to convert the categorical features to integers, and then I applied OneHotEncoder() to make several new feature columns of 0s and 1s, in order to apply my Machine Learning model. My X_train is now a numpy array, and after fitting the model I am getting its coefficients, so I'm wondering

Normalizing feature values for SVM

萝らか妹 提交于 2019-12-30 18:26:13
问题 I've been playing with some SVM implementations and I am wondering - what is the best way to normalize feature values to fit into one range? (from 0 to 1) Let's suppose I have 3 features with values in ranges of: 3 - 5. 0.02 - 0.05 10-15. How do I convert all of those values into range of [0,1]? What If, during training, the highest value of feature number 1 that I will encounter is 5 and after I begin to use my model on much bigger datasets, I will stumble upon values as high as 7? Then in

Normalizing feature values for SVM

随声附和 提交于 2019-12-30 18:26:09
问题 I've been playing with some SVM implementations and I am wondering - what is the best way to normalize feature values to fit into one range? (from 0 to 1) Let's suppose I have 3 features with values in ranges of: 3 - 5. 0.02 - 0.05 10-15. How do I convert all of those values into range of [0,1]? What If, during training, the highest value of feature number 1 that I will encounter is 5 and after I begin to use my model on much bigger datasets, I will stumble upon values as high as 7? Then in

Feature selection using python

给你一囗甜甜゛ 提交于 2019-12-23 05:22:40
问题 It's a letter recognition task and there are 284 images, and 19 classes. I want to apply naive bayesian. First I have to convert each image to feature vector and for reducing extra info I should use some feature selection code like cropping images to remove extra black borders. But I'm not much experienced in python. How can I crop black spaces in images in order to decrease the size of csv files? ( because the columns are more than expected!) And also how can I resize images to be the same

How to handle One-Hot Encoding in production environment when number of features in Training and Test are different?

£可爱£侵袭症+ 提交于 2019-12-22 08:46:38
问题 While doing certain experiments, we usually train on 70% and test on 33%. But, what happens when your model is in production? The following may occur: Training Set: ----------------------- | Ser |Type Of Car | ----------------------- | 1 | Hatchback | | 2 | Sedan | | 3 | Coupe | | 4 | SUV | ----------------------- After One- Hot Encoding this, this is what we get: ----------------------------------------- | Ser | Hatchback | Sedan | Coupe | SUV | ----------------------------------------- | 1

Feature importances - Bagging, scikit-learn

北慕城南 提交于 2019-12-22 04:53:07
问题 For a project I am comparing a number of decision trees, using the regression algorithms (Random Forest, Extra Trees, Adaboost and Bagging) of scikit-learn. To compare and interpret them I use the feature importance , though for the bagging decision tree this does not look to be available. My question: Does anybody know how to get the feature importances list for Bagging? Greetings, Kornee 回答1: Are you talking about BaggingClassifier? It can be used with many base estimators, so there is no