I\'m a relatively new user to sklearn and have run into some unexpected behavior in train_test_split from sklearn.model_selection. I have a pandas dataframe that I would like to
The reason you're getting duplicates is because train_test_split()
eventually defines strata as the unique set of values of whatever you passed into the stratify
argument. Since strata are defined from two columns, one row of data may represent more than one stratum, and so sampling may choose the same row twice because it thinks it's sampling from different classes.
The train_test_split()
function calls StratifiedShuffleSplit
, which uses np.unique()
on y
(which is what you pass in via stratify
). From the source code:
classes, y_indices = np.unique(y, return_inverse=True)
n_classes = classes.shape[0]
Here's a simplified sample case, a variation on the example you provided:
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
N = 20
a = np.arange(N)
b = np.random.choice(["foo","bar"], size=N)
c = np.random.choice(["y","z"], size=N)
df = pd.DataFrame({'a':a, 'b':b, 'c':c})
print(df)
a b c
0 0 bar y
1 1 foo y
2 2 bar z
3 3 bar y
4 4 foo z
5 5 bar y
...
The stratification function thinks there are four classes to split on: foo
, bar
, y
, and z
. But since these classes are essentially nested, meaning y
and z
both show up in b == foo
and b == bar
, we'll get duplicates when the splitter tries to sample from each class.
train, test = train_test_split(df, test_size=0.2, random_state=0,
stratify=df[['b', 'c']])
print(len(train.a.values)) # 16
print(len(set(train.a.values))) # 12
print(train)
a b c
3 3 bar y # selecting a = 3 for b = bar*
5 5 bar y
13 13 foo y
4 4 foo z
14 14 bar z
10 10 foo z
3 3 bar y # selecting a = 3 for c = y
6 6 bar y
16 16 foo y
18 18 bar z
6 6 bar y
8 8 foo y
18 18 bar z
7 7 bar z
4 4 foo z
19 19 bar y
#* We can't be sure which row is selecting for `bar` or `y`,
# I'm just illustrating the idea here.
There's a larger design question here: Do you want to used nested stratified sampling, or do you actually just want to treat each class in df.b
and df.c
as a separate class to sample from? If the latter, that's what you're already getting. The former is more complicated, and that's not what train_test_split
is set up to do.
You might find this discussion of nested stratified sampling useful.
What version of scikit-learn are you using ? You can use sklearn.__version__
to check.
The prior to version 0.19.0, scikit-learn does not handle 2-dimensional stratification correctly. It is patched in 0.19.0.
It is describled in issue #9044.
Update your scikit-learn should fix the problem. If you can't update your scikit-learn, see this commit history here for the fix.
If you want train_test_split
to behave as you expected (stratify by multiple columns with no duplicates), create a new column that is a concatenation of the values in your other columns and stratify on the new column.
df['bc'] = df['b'].astype(str) + df['c'].astype(str)
train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['bc']])
If you're worried about collision due to values like 11
and 3
and 1
and 13
both creating a concatenated value of 113
, then you can add some arbitrary string in the middle:
df['bc'] = df['b'].astype(str) + "_" + df['c'].astype(str)