问题
Here is the config of my model :
"model": {
"loss": "categorical_crossentropy",
"optimizer": "adam",
"layers": [
{
"type": "lstm",
"neurons": 180,
"input_timesteps": 15,
"input_dim": 103,
"return_seq": true,
"activation": "relu"
},
{
"type": "dropout",
"rate": 0.1
},
{
"type": "lstm",
"neurons": 100,
"activation": "relu",
"return_seq": false
},
{
"type": "dropout",
"rate": 0.1
},
{
"type": "dense",
"neurons": 30,
"activation": "relu"
},
{
"type": "dense",
"neurons": 3,
"activation": "softmax"
}
]
}
Once I finished to train a model, I decided to compare what the confusion matrix looks like if I shuffle or not the dataset and the labels.
I shuffled with the line
from sklearn.utils import shuffle
X, label = shuffle(X, label, random_state=0)
Be aware X
and label
are two testing sets. So it is not related to the training sets.
Confusion matrix with a shuffling phase
Confusion Matrix
[[16062 1676 3594]
[ 1760 4466 1482]
[ 3120 1158 13456]]
Classification Report
precision recall f1-score support
class -1 0.77 0.75 0.76 21332
class 0 0.61 0.58 0.60 7708
class 1 0.73 0.76 0.74 17734
avg / total 0.73 0.73 0.73 46774
Confusion matrix without a shuffling phase
Confusion Matrix
[[12357 2936 6039]
[ 1479 4301 1927]
[ 3316 1924 12495]]
Classification Report
precision recall f1-score support
class -1 0.72 0.58 0.64 21332
class 0 0.47 0.56 0.51 7707
class 1 0.61 0.70 0.65 17735
avg / total 0.64 0.62 0.62 46774
As you can see here, the precision for both reports are significantly different. What can explain the gap between those two reports?
回答1:
Data shuffling never hurts performance, and it very often helps, the reason being that it breaks possible biases during data preparation - e.g. putting all the cat images first and then the dog ones in a cat/dog classification dataset.
Take for example the famous iris dataset:
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
y
# result:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
As you can clearly see, the dataset has been prepared in such a way that the first 50 samples are all of label 0
, the next 50 of label 1
, and the last 50 of label 2
. Try to perform a 5-fold cross validation in such a dataset without shuffling and you'll find most of your folds containing only a single label; try a 3-fold CV, and all your folds will include only one label. Bad... BTW, it's not just a theoretical possibility, it has actually happened.
Since it's very difficult to know beforehand that such bias may exist in our dataset, we always shuffle (as said, it never hurts), just to be on the safe side, and that's why shuffling is a standard procedure in all machine learning pipelines.
So, even if the situation here obviously depends on the details of your data (which we don't know), this behavior is not at all surprising - on the contrary, it is totally expected.
回答2:
Your number of class 0 and class 1 for both confusion matrix is off by one.
You need to make sure that there is no mistake on matching the data to the class label.
来源:https://stackoverflow.com/questions/54373752/confusion-matrix-shuffle-vs-non-shuffle