Machine Learning Training & Test data split method

I was running a random forest classification model and initially divided the data into train (80%) and test (20%). However, the prediction had too many False Positive which I think was because there was too much noise in training data, so I decided to split the data in a different method and here's how I did it.

Since I thought the high False Positive was due to the noise in the train data, I made the train data to have the equal number of target variables. For example, if I have data of 10,000 rows and the target variable is 8,000 (0) and 2,000 (1), I had the training data to be a total of 4,000 rows including 2,000 (0) and 2,000 (1) so that the training data now have more signals.

When I tried this new splitting method, it predicted way better by increasing the Recall Positive from 14 % to 70%.

I would love to hear your feedback if I am doing anything wrong here. I am concerned if I am making my training data biased.

When you have unequal number of data points in each classes in training set, the baseline (random prediction) changes.

By noisy data, I think you want to mean that number of training points for class 1 is more than other. This is not really called noise. It is actually bias.

For ex: You have 10000 data point in training set, 8000 of class 1 and 2000 of class 0. I can predict class 0 all the time and get 80% accuracy already. This induces a bias and baseline for 0-1 classification will not be 50%.

To remove this bias either you can intentionally balance the training set as you did or you can change the error function by giving weight inversely proportional to number of points in training set.

来源：https://stackoverflow.com/questions/38640065/machine-learning-training-test-data-split-method

标签

machine-learning

scikit-learn

training-data

confusion-matrix