Okay, this is interesting..
I executed the same code a couple of times and each time I got a different accuracy_score
.
I figured that I was not using any rand
Essentially random_state
is going to make sure your code outputs the same results each time, by doing the same exact data splits each time. This is mostly helpful for your initial train/test split, and for creating code that others can replicate exactly.
The first thing to understand is that if you don't use random_state
, then the data will be split differently each time, which means that your training set and test sets will be different. This might not make a huge different, but it will result in slight variations in your model parameters / accuracy / etc. If you do set random_state
to the same value each time, like random_state=0
, then the data will be split the same way each time.
The second thing to understand is that each random_state
value will result in different splits and different behavior. So you need to keep random_state
as the same value if you want to be able to replicate results.
The third thing to understand is that multiple pieces of your model might have randomness in them. For example, your train_test_split
can accept random_state
, but so can RandomForestClassifier
. So in order to get the exact same results each time, you'll need to set random_state
for each piece of your model that has randomness in it.
If you're using random_state
to do your initial train/test split, you're going to want to set it once and use that split going forward to avoid overfitting to your test set.
Generally speaking, you can use cross-validation to assess the accuracy of your model and not worry too much about the random_state
.
A very important note is that you should not use random_state
to try to improve the accuracy of your model. This is by definition going to result in your model overfitting your data, and not generalizing as well to unseen data.