I have a pandas dataframe and I wish to divide it to 3 separate sets. I know that using train_test_split from sklearn.cross_validation
, one can divide the data
It is very convenient to use train_test_split
without performing reindexing after dividing to several sets and not writing some additional code. Best answer above does not mention that by separating two times using train_test_split
not changing partition sizes won`t give initially intended partition:
x_train, x_remain = train_test_split(x, test_size=(val_size + test_size))
Then the portion of validation and test sets in the x_remain change and could be counted as
new_test_size = np.around(test_size / (val_size + test_size), 2)
# To preserve (new_test_size + new_val_size) = 1.0
new_val_size = 1.0 - new_test_size
x_val, x_test = train_test_split(x_remain, test_size=new_test_size)
In this occasion all initial partitions are saved.