Classification results depend on random_state?

浪子不回头ぞ 提交于 2019-11-30 06:05:17

问题


I want to implement a AdaBoost model using scikit-learn (sklearn). My question is similar to another question but it is not totally the same. As far as I understand, the random_state variable described in the documentation is for randomly splitting the training and testing sets, according to the previous link. So if I understand correctly, my classification results should not be dependent on the seeds, is it correct? Should I be worried if my classification results turn out to be dependent on the random_state variable?


回答1:


Your classification scores will depend on random_state. As @Ujjwal rightly said, it is used for splitting the data into training and test test. Not just that, a lot of algorithms in scikit-learn use the random_state to select the subset of features, subsets of samples, and determine the initial weights etc.

For eg.

  • Tree based estimators will use the random_state for random selections of features and samples (like DecisionTreeClassifier, RandomForestClassifier).

  • In clustering estimators like Kmeans, random_state is used to initialize centers of clusters.

  • SVMs use it for initial probability estimation

  • Some feature selection algorithms also use it for initial selection
  • And many more...

Its mentioned in the documentation that:

If your code relies on a random number generator, it should never use functions like numpy.random.random or numpy.random.normal. This approach can lead to repeatability issues in tests. Instead, a numpy.random.RandomState object should be used, which is built from a random_state argument passed to the class or function.

Do read the following questions and answers for better understanding:

  • Choosing random_state for sklearn algorithms
  • confused about random_state in decision tree of scikit learn



回答2:


It does matter. When your training set differs then your trained state also changes. For a different subset of data you can end up with a classifier which is little different from the one trained with some other subset.

Hence, you should use a constant seed like 0 or another integer, so that your results are reproducible.



来源:https://stackoverflow.com/questions/42476032/classification-results-depend-on-random-state

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!