问题
I use the following code. I would like to get the same results for the same random seed. I use the same random seed (1 in this case) and get different results. Here is the code:
import pandas as pd
import numpy as np
from random import seed
# Load scikit's random forest classifier library
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
seed(1) ### <-----
file_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data'
dataset2 = pd.read_csv(file_path, header=None, sep=',')
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
#Encoding
y = le.fit_transform(dataset2[60])
dataset2[60] = y
train, test = train_test_split(dataset2, test_size=0.1)
y = train[60]
y_test = test[60]
clf = RandomForestClassifier(n_jobs=100, random_state=0)
features = train.columns[0:59]
clf.fit(train[features], y)
# Apply the Classifier we trained to the test data
y_pred = clf.predict(test[features])
# Decode
y_test_label = le.inverse_transform(y_test)
y_pred_label = le.inverse_transform(y_pred)
from sklearn.metrics import accuracy_score
print (accuracy_score(y_test_label, y_pred_label))
# Two following results:
# 0.761904761905
# 0.90476190476
回答1:
Your code:
import numpy as np
from random import seed
seed(1) ### <-----
sets the random-seed of python's random-class.
But sklearn is completely based on numpy's random class, as explained here:
For testing and replicability, it is often important to have the entire execution controlled by a single seed for the pseudo-random number generator used in algorithms that have a randomized component. Scikit-learn does not use its own global random state; whenever a RandomState instance or an integer random seed is not provided as an argument, it relies on the numpy global random state, which can be set using numpy.random.seed. For example, to set an execution’s numpy global random state to 42, one could execute the following in his or her script:
import numpy as np
np.random.seed(42)
So in general you should do:
np.random.seed(1)
But this is only part of the truth, as often this not needed when being careful with all the sklearn-components in use, explicitly calling them with some seed!
Like ShreyasG mentioned, this also applies to train_test_split
来源:https://stackoverflow.com/questions/46661426/why-random-seed-does-not-make-results-constant-in-python