train-test-split

Should Feature Selection be done before Train-Test Split or after?

被刻印的时光 ゝ 提交于 2019-12-01 14:24:56
Actually, there is a contradiction of 2 facts that are the possible answers to the question: The conventional answer is to do it after splitting as there can be information leakage, if done before, from the Test-Set. The contradicting answer is that, if only the Training Set chosen from the whole dataset is used for Feature Selection, then the feature selection or feature importance score orders is likely to be dynamically changed with change in random_state of the Train_Test_Split. And if the feature selection for any particular work changes, then no Generalization of Feature Importance can

What is the correct procedure to split the Data sets for classification problem?

坚强是说给别人听的谎言 提交于 2019-11-29 18:07:26
I am new to Machine Learning & Deep Learning. I would like to clarify my doubt related to train_test_split before training I have a data set of size (302, 100, 5) , where, (207,100,5) belongs to class 0 (95,100,5) belongs to class 1. I would like to perform Classification using LSTM (since, sequence Data) How can i split my data set for training, since the classes do not have equal distribution sets ? Option 1 : Consider whole data [(302,100, 5) - both classes (0 & 1)] , shuffle it, train_test_split, proceed training. Option 2 : Split both class data set equally [(95,100,5) - class 0 & (95,100

Should Feature Selection be done before Train-Test Split or after?

☆樱花仙子☆ 提交于 2019-11-28 07:08:15
问题 Actually, there is a contradiction of 2 facts that are the possible answers to the question: The conventional answer is to do it after splitting as there can be information leakage, if done before, from the Test-Set. The contradicting answer is that, if only the Training Set chosen from the whole dataset is used for Feature Selection, then the feature selection or feature importance score orders is likely to be dynamically changed with change in random_state of the Train_Test_Split. And if

Order between using validation, training and test sets

六眼飞鱼酱① 提交于 2019-11-27 09:38:26
I am trying to understand the process of model evaluation and validation in machine learning. Specifically, in which order and how the training, validation and test sets must be used. Let's say I have a dataset and I want to use linear regression. I am hesitating among various polynomial degrees (hyper-parameters). In this wikipedia article , it seems to imply that the sequence should be: Split data into training set, validation set and test set Use the training set to fit the model (find the best parameters: coefficients of the polynomial). Afterwards , use the validation set to find the best