问题
Okay, this is interesting..
I executed the same code a couple of times and each time I got a different accuracy_score
.
I figured that I was not using any random_state
value while train_test splitting
. so I used random_state=0
and got consistent Accuracy_score
of 82%. but...
then I thought to give it a try with different random_state
number and I set random_state=128
and Accuracy_score
becomes 84%.
Now I need to understand why is that and how random_state
affects the accuracy of the model.
Outputs are as below:
1> without random_state:
runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')
: boolean
use_inf_as_null had been deprecated and will be removed in a future
version. Use `use_inf_as_na` instead.
[[90 22]
[21 46]]
0.7597765363128491
runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')
: boolean
use_inf_as_null had been deprecated and will be removed in a future
version. Use `use_inf_as_na` instead.
[[104 16]
[ 14 45]]
0.8324022346368715
runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')
: boolean
use_inf_as_null had been deprecated and will be removed in a future
version. Use `use_inf_as_na` instead.
[[90 18]
[12 59]]
0.8324022346368715
runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')
: boolean
use_inf_as_null had been deprecated and will be removed in a future
version. Use `use_inf_as_na` instead.
[[99 9]
[19 52]]
0.8435754189944135
2> with random_state = 128 (Accuracy_score = 84%)
runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')
: boolean
use_inf_as_null had been deprecated and will be removed in a future
version. Use `use_inf_as_na` instead.
[[106 13]
[ 15 45]]
0.8435754189944135
runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')
: boolean
use_inf_as_null had been deprecated and will be removed in a future
version. Use `use_inf_as_na` instead.
[[106 13]
[ 15 45]]
0.8435754189944135
3> with random_state = 0 (Accuracy_score = 82%)
runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')
: boolean
use_inf_as_null had been deprecated and will be removed in a future
version. Use `use_inf_as_na` instead.
[[93 17]
[15 54]]
0.8212290502793296
runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')
: boolean
use_inf_as_null had been deprecated and will be removed in a future
version. Use `use_inf_as_na` instead.
[[93 17]
[15 54]]
0.8212290502793296
回答1:
Essentially random_state
is going to make sure your code outputs the same results each time, by doing the same exact data splits each time. This is mostly helpful for your initial train/test split, and for creating code that others can replicate exactly.
Splitting the data the same vs. differently
The first thing to understand is that if you don't use random_state
, then the data will be split differently each time, which means that your training set and test sets will be different. This might not make a huge different, but it will result in slight variations in your model parameters / accuracy / etc. If you do set random_state
to the same value each time, like random_state=0
, then the data will be split the same way each time.
Each random_state results in a different split
The second thing to understand is that each random_state
value will result in different splits and different behavior. So you need to keep random_state
as the same value if you want to be able to replicate results.
Your model can have multiple random_state pieces
The third thing to understand is that multiple pieces of your model might have randomness in them. For example, your train_test_split
can accept random_state
, but so can RandomForestClassifier
. So in order to get the exact same results each time, you'll need to set random_state
for each piece of your model that has randomness in it.
Conclusions
If you're using random_state
to do your initial train/test split, you're going to want to set it once and use that split going forward to avoid overfitting to your test set.
Generally speaking, you can use cross-validation to assess the accuracy of your model and not worry too much about the random_state
.
A very important note is that you should not use random_state
to try to improve the accuracy of your model. This is by definition going to result in your model overfitting your data, and not generalizing as well to unseen data.
来源:https://stackoverflow.com/questions/63585051/random-states-contribution-to-accuracy