Random_state's contribution to accuracy

问题

Okay, this is interesting.. I executed the same code a couple of times and each time I got a different accuracy_score. I figured that I was not using any random_state value while train_test splitting. so I used random_state=0 and got consistent Accuracy_score of 82%. but... then I thought to give it a try with different random_state number and I set random_state=128 and Accuracy_score becomes 84%. Now I need to understand why is that and how random_state affects the accuracy of the model. Outputs are as below: 1> without random_state:

runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')

: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

[[90 22]
 [21 46]]
0.7597765363128491

runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')

: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

[[104  16]
 [ 14  45]]
0.8324022346368715

runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')

: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

[[90 18]
 [12 59]]
0.8324022346368715

runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')

: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

[[99  9]
 [19 52]]
0.8435754189944135

2> with random_state = 128 (Accuracy_score = 84%)

runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')

: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

[[106  13]
 [ 15  45]]
0.8435754189944135

runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')

: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

[[106  13]
 [ 15  45]]
0.8435754189944135

3> with random_state = 0 (Accuracy_score = 82%)

runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')

: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

[[93 17]
 [15 54]]
0.8212290502793296

runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')

: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

[[93 17]
 [15 54]]
0.8212290502793296

回答1:

Essentially random_state is going to make sure your code outputs the same results each time, by doing the same exact data splits each time. This is mostly helpful for your initial train/test split, and for creating code that others can replicate exactly.

Splitting the data the same vs. differently

The first thing to understand is that if you don't use random_state, then the data will be split differently each time, which means that your training set and test sets will be different. This might not make a huge different, but it will result in slight variations in your model parameters / accuracy / etc. If you do set random_state to the same value each time, like random_state=0, then the data will be split the same way each time.

Each random_state results in a different split

The second thing to understand is that each random_state value will result in different splits and different behavior. So you need to keep random_state as the same value if you want to be able to replicate results.

Your model can have multiple random_state pieces

The third thing to understand is that multiple pieces of your model might have randomness in them. For example, your train_test_split can accept random_state, but so can RandomForestClassifier. So in order to get the exact same results each time, you'll need to set random_state for each piece of your model that has randomness in it.

Conclusions

If you're using random_state to do your initial train/test split, you're going to want to set it once and use that split going forward to avoid overfitting to your test set.

Generally speaking, you can use cross-validation to assess the accuracy of your model and not worry too much about the random_state.

A very important note is that you should not use random_state to try to improve the accuracy of your model. This is by definition going to result in your model overfitting your data, and not generalizing as well to unseen data.

来源：https://stackoverflow.com/questions/63585051/random-states-contribution-to-accuracy

标签

python

machine-learning

scikit-learn

data-science