What is a good way to split a NumPy array randomly into training and testing/validation dataset? Something similar to the cvpartition
or crossvalind
I wrote a function for my own project to do this (it doesn't use numpy, though):
def partition(seq, chunks):
"""Splits the sequence into equal sized chunks and them as a list"""
result = []
for i in range(chunks):
chunk = []
for element in seq[i:len(seq):chunks]:
chunk.append(element)
result.append(chunk)
return result
If you want the chunks to be randomized, just shuffle the list before passing it in.
Thanks pberkes for your answer. I just modified it to avoid (1) replacement while sampling (2) duplicated instances occurred in both training and testing:
training_idx = np.random.choice(X.shape[0], int(np.round(X.shape[0] * 0.8)),replace=False)
training_idx = np.random.permutation(np.arange(X.shape[0]))[:np.round(X.shape[0] * 0.8)]
test_idx = np.setdiff1d( np.arange(0,X.shape[0]), training_idx)
Split into train test and valid
x =np.expand_dims(np.arange(100), -1)
print(x)
indices = np.random.permutation(x.shape[0])
training_idx, test_idx, val_idx = indices[:int(x.shape[0]*.9)], indices[int(x.shape[0]*.9):int(x.shape[0]*.95)], indices[int(x.shape[0]*.9):int(x.shape[0]*.95)]
training, test, val = x[training_idx,:], x[test_idx,:], x[val_idx,:]
print(training, test, val)
If you want to split the data set once in two halves, you can use numpy.random.shuffle
, or numpy.random.permutation
if you need to keep track of the indices:
import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
numpy.random.shuffle(x)
training, test = x[:80,:], x[80:,:]
or
import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
indices = numpy.random.permutation(x.shape[0])
training_idx, test_idx = indices[:80], indices[80:]
training, test = x[training_idx,:], x[test_idx,:]
There are many ways to repeatedly partition the same data set for cross validation. One strategy is to resample from the dataset, with repetition:
import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
training_idx = numpy.random.randint(x.shape[0], size=80)
test_idx = numpy.random.randint(x.shape[0], size=20)
training, test = x[training_idx,:], x[test_idx,:]
Finally, sklearn contains several cross validation methods (k-fold, leave-n-out, ...). It also includes more advanced "stratified sampling" methods that create a partition of the data that is balanced with respect to some features, for example to make sure that there is the same proportion of positive and negative examples in the training and test set.
Likely you will not only need to split into train and test, but also cross validation to make sure your model generalizes. Here I am assuming 70% training data, 20% validation and 10% holdout/test data.
Check out the np.split:
If indices_or_sections is a 1-D array of sorted integers, the entries indicate where along axis the array is split. For example, [2, 3] would, for axis=0, result in
ary[:2] ary[2:3] ary[3:]
t, v, h = np.split(df.sample(frac=1, random_state=1), [int(0.7*len(df)), int(0.9*len(df))])
I'm aware that my solution is not the best, but it comes in handy when you want to split data in a simplistic way, especially when teaching data science to newbies!
def simple_split(descriptors, targets):
testX_indices = [i for i in range(descriptors.shape[0]) if i % 4 == 0]
validX_indices = [i for i in range(descriptors.shape[0]) if i % 4 == 1]
trainX_indices = [i for i in range(descriptors.shape[0]) if i % 4 >= 2]
TrainX = descriptors[trainX_indices, :]
ValidX = descriptors[validX_indices, :]
TestX = descriptors[testX_indices, :]
TrainY = targets[trainX_indices]
ValidY = targets[validX_indices]
TestY = targets[testX_indices]
return TrainX, ValidX, TestX, TrainY, ValidY, TestY
According to this code, data will be split into three parts - 1/4 for the test part, another 1/4 for the validation part, and 2/4 for the training set.