问题
So I'm using sci-kit learn to classify some data. I have 13 different class values/categorizes to classify the data to. Now I have been able to use cross validation and print the confusion matrix. However, it only shows the TP and FP etc without the classlabels, so I don't know which class is what. Below is my code and my output:
def classify_data(df, feature_cols, file):
nbr_folds = 5
RANDOM_STATE = 0
attributes = df.loc[:, feature_cols] # Also known as x
class_label = df['task'] # Class label, also known as y.
file.write("\nFeatures used: ")
for feature in feature_cols:
file.write(feature + ",")
print("Features used", feature_cols)
sampler = RandomOverSampler(random_state=RANDOM_STATE)
print("RandomForest")
file.write("\nRandomForest")
rfc = RandomForestClassifier(max_depth=2, random_state=RANDOM_STATE)
pipeline = make_pipeline(sampler, rfc)
class_label_predicted = cross_val_predict(pipeline, attributes, class_label, cv=nbr_folds)
conf_mat = confusion_matrix(class_label, class_label_predicted)
print(conf_mat)
accuracy = accuracy_score(class_label, class_label_predicted)
print("Rows classified: " + str(len(class_label_predicted)))
print("Accuracy: {0:.3f}%\n".format(accuracy * 100))
file.write("\nClassifier settings:" + str(pipeline) + "\n")
file.write("\nRows classified: " + str(len(class_label_predicted)))
file.write("\nAccuracy: {0:.3f}%\n".format(accuracy * 100))
file.writelines('\t'.join(str(j) for j in i) + '\n' for i in conf_mat)
#Output
Rows classified: 23504
Accuracy: 17.925%
0 372 46 88 5 73 0 536 44 317 0 200 127
0 501 29 85 0 136 0 655 9 154 0 172 67
0 97 141 78 1 56 0 336 37 429 0 435 198
0 135 74 416 5 37 0 507 19 323 0 128 164
0 247 72 145 12 64 0 424 21 296 0 304 223
0 190 41 36 0 178 0 984 29 196 0 111 43
0 218 13 71 7 52 0 917 139 177 0 111 103
0 215 30 84 3 71 0 1175 11 55 0 102 62
0 257 55 156 1 13 0 322 184 463 0 197 160
0 188 36 104 2 34 0 313 99 827 0 69 136
0 281 80 111 22 16 0 494 19 261 0 313 211
0 207 66 87 18 58 0 489 23 157 0 464 239
0 113 114 44 6 51 0 389 30 408 0 338 315
As you can see, you can't really know what column is what and the print is also "misaligned" so it's difficult to understand.
Is there a way to print the labels as well?
回答1:
From the doc, it seems that there is no such option to print the rows and column labels of the confusion matrix. However, you can specify the label order using argument labels=...
Example:
from sklearn.metrics import confusion_matrix
y_true = ['yes','yes','yes','no','no','no']
y_pred = ['yes','no','no','no','no','no']
print(confusion_matrix(y_true, y_pred))
# Output:
# [[3 0]
# [2 1]]
print(confusion_matrix(y_true, y_pred, labels=['yes', 'no']))
# Output:
# [[1 2]
# [0 3]]
If you want to print the confusion matrix with labels, you may try pandas
and set the index
and columns
of the DataFrame
.
import pandas as pd
cmtx = pd.DataFrame(
confusion_matrix(y_true, y_pred, labels=['yes', 'no']),
index=['true:yes', 'true:no'],
columns=['pred:yes', 'pred:no']
)
print(cmtx)
# Output:
# pred:yes pred:no
# true:yes 1 2
# true:no 0 3
Or
unique_label = np.unique([y_true, y_pred])
cmtx = pd.DataFrame(
confusion_matrix(y_true, y_pred, labels=unique_label),
index=['true:{:}'.format(x) for x in unique_label],
columns=['pred:{:}'.format(x) for x in unique_label]
)
print(cmtx)
# Output:
# pred:no pred:yes
# true:no 3 0
# true:yes 2 1
回答2:
It is important to ensure that the way you label your confusion matrix rows and columns corresponds exactly to the way sklearn has coded the classes. The true order of the labels can be revealed using the .classes_ attribute of the classifier. You can use the code below to prepare a confusion matrix data frame.
labels = rfc.classes_
conf_df = pd.DataFrame(confusion_matrix(class_label, class_label_predicted, columns=labels, index=labels))
conf_df.index.name = 'True labels'
The second thing to note is that your classifier is not predicting labels well. The number of correctly predicted labels is shown on the main diagonal of the confusion matrix. You have non-zero values accross the matrix and some classes have not been predicted at all - the columns that are all zero. It might be a good idea to run the classifier with its default parameters and then try to optimise them.
回答3:
Since confusion matrix is just a numpy matrix, it does not contain any column information. What you can do is convert your matrix into a dataframe and then print this dataframe.
import pandas as pd
import numpy as np
def cm2df(cm, labels):
df = pd.DataFrame()
# rows
for i, row_label in enumerate(labels):
rowdata={}
# columns
for j, col_label in enumerate(labels):
rowdata[col_label]=cm[i,j]
df = df.append(pd.DataFrame.from_dict({row_label:rowdata}, orient='index'))
return df[labels]
cm = np.arange(9).reshape((3, 3))
df = cm2df(cm, ["a", "b", "c"])
print(df)
Code snippet is from https://gist.github.com/nickynicolson/202fe765c99af49acb20ea9f77b6255e
Output:
a b c
a 0 1 2
b 3 4 5
c 6 7 8
回答4:
It appears your data has 13 different classes, which is why your confusion matrix has 13 rows and columns. Furthermore, your classes aren't labeled in any way, just integers from what I can see.
If this isn't the case, and your training data has actual labels, you can pass a list of unique labels to confusion_matrix
conf_mat = confusion_matrix(class_label, class_label_predicted, df['task'].unique())
回答5:
Another better way of doing this is using crosstab function in pandas.
pd.crosstab(y_true, y_pred, rownames=['True'], colnames=['Predicted'], margins=True).
or
pd.crosstab(le.inverse_transform(y_true), le.inverse_transform(y_pred),rownames=['True'], colnames=['Predicted'], margins=True)
来源:https://stackoverflow.com/questions/50325786/sci-kit-learn-how-to-print-labels-for-confusion-matrix