train-test-split

Normalize data before or after split of training and testing data?

百般思念 提交于 2019-12-20 08:42:51
问题 I want to separate my data into train and test set, should I apply normalization over data before or after the split? Does it make any difference while building predictive model? Thanks in advance. 回答1: You first need to split the data into training and test set (validation set might also be required). Don't forget that testing data points represent real-world data. Feature normalization (or data standardization) of the explanatory (or predictor) variables is a technique used to center and

Normalize data before or after split of training and testing data?

泄露秘密 提交于 2019-12-20 08:42:08
问题 I want to separate my data into train and test set, should I apply normalization over data before or after the split? Does it make any difference while building predictive model? Thanks in advance. 回答1: You first need to split the data into training and test set (validation set might also be required). Don't forget that testing data points represent real-world data. Feature normalization (or data standardization) of the explanatory (or predictor) variables is a technique used to center and

What is the correct procedure to split the Data sets for classification problem?

岁酱吖の 提交于 2019-12-18 09:45:39
问题 I am new to Machine Learning & Deep Learning. I would like to clarify my doubt related to train_test_split before training I have a data set of size (302, 100, 5) , where, (207,100,5) belongs to class 0 (95,100,5) belongs to class 1. I would like to perform Classification using LSTM (since, sequence Data) How can i split my data set for training, since the classes do not have equal distribution sets ? Option 1 : Consider whole data [(302,100, 5) - both classes (0 & 1)] , shuffle it, train

sklearn TimeSeriesSplit Error: KeyError: '[ 0 1 2 …] not in index'

点点圈 提交于 2019-12-08 09:50:55
问题 I want to use TimeSeriesSplit from sklearn on the following dataframe to predict sum: So to prepare X and y I do the following: X = df.drop(['sum'],axis=1) y = df['sum'] and then feed these two to: for train_index, test_index in tscv.split(X): X_train01, X_test01 = X[train_index], X[test_index] y_train01, y_test01 = y[train_index], y[test_index] by doing so, I get the following error: KeyError: '[ 0 1 2 ...] not in index' Here X is a dataframe, and apparently this cause the error, because if

Stratified Train/Validation/Test-split in scikit-learn

眉间皱痕 提交于 2019-12-03 14:44:52
There is already a description here of how to do stratified train/test split in scikit via train_test_split ( Stratified Train/Test-split in scikit-learn ) and a description of how to random train/validation/test split via np.split ( How to split data into 3 sets (train, validation and test)? ). But what about doing stratified train/validation/test split. The closest approximation that comes to mind for doing stratified (on class label) train/validation/test split is as follows, but I suspect there's a better way that can perhaps achieve this in one function call or in a more accurate way: Let

Keras split train test set when using ImageDataGenerator

青春壹個敷衍的年華 提交于 2019-12-03 03:30:58
问题 I have a single directory which contains sub-folders (according to labels) of images. I want to split this data into train and test set while using ImageDataGenerator in Keras. Although model.fit() in keras has argument validation_split for specifying the split, I could not find the same for model.fit_generator(). How to do it ? train_datagen = ImageDataGenerator(rescale=1./255, shear_range=0.2, zoom_range=0.2, horizontal_flip=True) train_generator = train_datagen.flow_from_directory( train

How to perform k-fold cross validation with tensorflow?

蹲街弑〆低调 提交于 2019-12-02 17:37:26
I am following the IRIS example of tensorflow . My case now is I have all data in a single CSV file, not separated, and I want to apply k-fold cross validation on that data. I have data_set = tf.contrib.learn.datasets.base.load_csv(filename="mydata.csv", target_dtype=np.int) How can I perform k-fold cross validation on this dataset with multi-layer neural network as same as IRIS example? I know this question is old but in case someone is looking to do something similar, expanding on ahmedhosny's answer: The new tensorflow datasets API has the ability to create dataset objects using python

Keras split train test set when using ImageDataGenerator

一曲冷凌霜 提交于 2019-12-02 17:28:53
I have a single directory which contains sub-folders (according to labels) of images. I want to split this data into train and test set while using ImageDataGenerator in Keras. Although model.fit() in keras has argument validation_split for specifying the split, I could not find the same for model.fit_generator(). How to do it ? train_datagen = ImageDataGenerator(rescale=1./255, shear_range=0.2, zoom_range=0.2, horizontal_flip=True) train_generator = train_datagen.flow_from_directory( train_data_dir, target_size=(img_width, img_height), batch_size=32, class_mode='binary') model.fit_generator(

dimension mismatch error in CountVectorizer MultinomialNB

有些话、适合烂在心里 提交于 2019-12-02 04:17:23
问题 Before I lodge this question, I have to say I've thoroughly read more than 15 similar topics on this board, each with somehow different recommendations, but all of them just could not get me right. Ok, so I split my 'spam email' text data (originally in csv format) into training and test sets, using CountVectorizer and its 'fit_transform' function to fit the vocabulary of the corpus and extracts word count features from text. And then I applied MultinomialNB() to learn from training set and

dimension mismatch error in CountVectorizer MultinomialNB

对着背影说爱祢 提交于 2019-12-01 22:53:46
Before I lodge this question, I have to say I've thoroughly read more than 15 similar topics on this board, each with somehow different recommendations, but all of them just could not get me right. Ok, so I split my 'spam email' text data (originally in csv format) into training and test sets, using CountVectorizer and its 'fit_transform' function to fit the vocabulary of the corpus and extracts word count features from text. And then I applied MultinomialNB() to learn from training set and predict on test set. Here is my code (simplified): from sklearn.feature_extraction.text import