问题
What might be some key factors for increasing or stabilizing the accuracy score (NOT TO significantly vary) of this basic KNN model on IRIS data?
Attempt
from sklearn import neighbors, datasets, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
iris = datasets.load_iris()
X, y = iris.data[:, :], iris.target
Xtrain, Xtest, y_train, y_test = train_test_split(X, y)
scaler = preprocessing.StandardScaler().fit(Xtrain)
Xtrain = scaler.transform(Xtrain)
Xtest = scaler.transform(Xtest)
knn = neighbors.KNeighborsClassifier(n_neighbors=4)
knn.fit(Xtrain, y_train)
y_pred = knn.predict(Xtest)
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
Sample Accuracy Scores
0.9736842105263158
0.9473684210526315
1.0
0.9210526315789473
Classification Report
precision recall f1-score support
0 1.00 1.00 1.00 12
1 0.79 1.00 0.88 11
2 1.00 0.80 0.89 15
accuracy 0.92 38
macro avg 0.93 0.93 0.92 38
weighted avg 0.94 0.92 0.92 38
Sample Confusion Matrix
[[12 0 0]
[ 0 11 0]
[ 0 3 12]]
回答1:
I would recommend tuning the k
value for k-NN. As iris is a small dataset and nicely balanced, I will do the following:
For every value of `k` in range [2 to 10] (say) Perform a n-times k-folds crossvalidation (say n=20 and k=4) Store the Accuracy values (or any other metric)
Plot the scores based on the average and variance and select the value of k
with the best k-value. The main target of crossvalidation is to estimate the test error, and based on that you select the final model. There will be some variance, but it should be less than 0.03 or something like that. That depends on the dataset and the number of folds you take. One good process is, for each value of k
make a boxplot of all the 20x4 Accuracy values. Select the value of k
for which the lower quantile intersects the upper quantile, or in simple words, in there is not too much change in the accuracy (or other metric values).
Once you select the value of k
based on this, the target is to use this value to build the final model using the entire training dataset. Next, this can be used to predict new data.
On the other hand, for larger datasets. Make a separate test partition (as you did here), and then tune the k
value on only the training set (using crossvalidation, forget about the test set). After selecting an appropriate k
train the algorithm, use only the training set to train. Next, use the test set to report the final value. Never take any decision based on the test set.
Yet another method is train, validation, test partition. Train using the train set, and train models using different values of k
, and then predict using the validation partition and list the scores. Select the best score based on this validation partition. Next use the train or train+validation set to train the final model using the value of k
selected based on the validation set. Finally, take out the the test set and report the final score. Again, never use the test set anywhere else.
These are general methods applicable to any machine learning or statistical learning methods.
Immportant thing to note when you perform the partition (train,test or for crossvalidation), use stratified sampling so that in each partition the class ratios stay the same.
Read more about crossvalidation. In scikitlearn it is easy to do. If using R, you can use the caret.
Main thing to remember that the target is to train a function which generalises on new data, or performs well on new data, and not perform not only perform good on the existing data.
回答2:
There are only 3 classes available in iris dataset, Iris-Setosa, Iris-Virginica, and Iris-Versicolor.
Use this code. This gives me 97.78%
accuracy
from sklearn import neighbors, datasets, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
iris = datasets.load_iris()
X, y = iris.data[:, :], iris.target
Xtrain, Xtest, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 0, train_size = 0.7)
scaler = preprocessing.StandardScaler().fit(Xtrain)
Xtrain = scaler.transform(Xtrain)
Xtest = scaler.transform(Xtest)
knn = neighbors.KNeighborsClassifier(n_neighbors=3)
knn.fit(Xtrain, y_train)
y_pred = knn.predict(Xtest)
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
来源:https://stackoverflow.com/questions/56895458/accuracy-score-for-a-knn-model-iris-data