Confusion Matrix : Shuffle vs Non-Shuffle

问题

Here is the config of my model :

"model": {
        "loss": "categorical_crossentropy",
        "optimizer": "adam",
        "layers": [
            {
                "type": "lstm",
                "neurons": 180,
                "input_timesteps": 15,
                "input_dim": 103,
                "return_seq": true,
                "activation": "relu"
            },
            {
                "type": "dropout",
                "rate": 0.1
            },
            {
                "type": "lstm",
                "neurons": 100,
                "activation": "relu",
                "return_seq": false
            },
            {
                "type": "dropout",
                "rate": 0.1
            },
            {
                "type": "dense",
                "neurons": 30,
                "activation": "relu"
            },
            {
                "type": "dense",
                "neurons": 3,
                "activation": "softmax"
            }
        ]
    }

Once I finished to train a model, I decided to compare what the confusion matrix looks like if I shuffle or not the dataset and the labels.

I shuffled with the line

from sklearn.utils import shuffle
X, label = shuffle(X, label, random_state=0)

Be aware X and label are two testing sets. So it is not related to the training sets.

Confusion matrix with a shuffling phase

Confusion Matrix
[[16062  1676  3594]
 [ 1760  4466  1482]
 [ 3120  1158 13456]]
Classification Report
             precision    recall  f1-score   support

   class -1       0.77      0.75      0.76     21332
    class 0       0.61      0.58      0.60      7708
    class 1       0.73      0.76      0.74     17734

avg / total       0.73      0.73      0.73     46774

Confusion matrix without a shuffling phase

Confusion Matrix
[[12357  2936  6039]
 [ 1479  4301  1927]
 [ 3316  1924 12495]]
Classification Report
             precision    recall  f1-score   support

   class -1       0.72      0.58      0.64     21332
    class 0       0.47      0.56      0.51      7707
    class 1       0.61      0.70      0.65     17735

avg / total       0.64      0.62      0.62     46774

As you can see here, the precision for both reports are significantly different. What can explain the gap between those two reports?

回答1:

Data shuffling never hurts performance, and it very often helps, the reason being that it breaks possible biases during data preparation - e.g. putting all the cat images first and then the dog ones in a cat/dog classification dataset.

Take for example the famous iris dataset:

from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
y
# result:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

As you can clearly see, the dataset has been prepared in such a way that the first 50 samples are all of label 0, the next 50 of label 1, and the last 50 of label 2. Try to perform a 5-fold cross validation in such a dataset without shuffling and you'll find most of your folds containing only a single label; try a 3-fold CV, and all your folds will include only one label. Bad... BTW, it's not just a theoretical possibility, it has actually happened.

Since it's very difficult to know beforehand that such bias may exist in our dataset, we always shuffle (as said, it never hurts), just to be on the safe side, and that's why shuffling is a standard procedure in all machine learning pipelines.

So, even if the situation here obviously depends on the details of your data (which we don't know), this behavior is not at all surprising - on the contrary, it is totally expected.

回答2:

Your number of class 0 and class 1 for both confusion matrix is off by one.

You need to make sure that there is no mistake on matching the data to the class label.

来源：https://stackoverflow.com/questions/54373752/confusion-matrix-shuffle-vs-non-shuffle

标签

python

machine-learning

shuffle

confusion-matrix