How to split/partition a dataset into training and test datasets for, e.g., cross validation?

前端 未结 12 1979
醉话见心
醉话见心 2020-11-27 10:42

What is a good way to split a NumPy array randomly into training and testing/validation dataset? Something similar to the cvpartition or crossvalind

相关标签:
12条回答
  • 2020-11-27 11:27

    There is another option that just entails using scikit-learn. As scikit's wiki describes, you can just use the following instructions:

    from sklearn.model_selection import train_test_split
    
    data, labels = np.arange(10).reshape((5, 2)), range(5)
    
    data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.20, random_state=42)
    

    This way you can keep in sync the labels for the data you're trying to split into training and test.

    0 讨论(0)
  • 2020-11-27 11:28

    Here is a code to split the data into n=5 folds in a stratified manner

    % X = data array
    % y = Class_label
    from sklearn.cross_validation import StratifiedKFold
    skf = StratifiedKFold(y, n_folds=5)
    for train_index, test_index in skf:
        print("TRAIN:", train_index, "TEST:", test_index)
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
    
    0 讨论(0)
  • 2020-11-27 11:30

    Just a note. In case you want train, test, AND validation sets, you can do this:

    from sklearn.cross_validation import train_test_split
    
    X = get_my_X()
    y = get_my_y()
    x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
    x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5)
    

    These parameters will give 70 % to training, and 15 % each to test and val sets. Hope this helps.

    0 讨论(0)
  • 2020-11-27 11:35

    After doing some reading and taking into account the (many..) different ways of splitting the data to train and test, I decided to timeit!

    I used 4 different methods (non of them are using the library sklearn, which I'm sure will give the best results, giving that it is well designed and tested code):

    1. shuffle the whole matrix arr and then split the data to train and test
    2. shuffle the indices and then assign it x and y to split the data
    3. same as method 2, but in a more efficient way to do it
    4. using pandas dataframe to split

    method 3 won by far with the shortest time, after that method 1, and method 2 and 4 discovered to be really inefficient.

    The code for the 4 different methods I timed:

    import numpy as np
    arr = np.random.rand(100, 3)
    X = arr[:,:2]
    Y = arr[:,2]
    spl = 0.7
    N = len(arr)
    sample = int(spl*N)
    
    #%% Method 1:  shuffle the whole matrix arr and then split
    np.random.shuffle(arr)
    x_train, x_test, y_train, y_test = X[:sample,:], X[sample:, :], Y[:sample, ], Y[sample:,]
    
    #%% Method 2: shuffle the indecies and then shuffle and apply to X and Y
    train_idx = np.random.choice(N, sample)
    Xtrain = X[train_idx]
    Ytrain = Y[train_idx]
    
    test_idx = [idx for idx in range(N) if idx not in train_idx]
    Xtest = X[test_idx]
    Ytest = Y[test_idx]
    
    #%% Method 3: shuffle indicies without a for loop
    idx = np.random.permutation(arr.shape[0])  # can also use random.shuffle
    train_idx, test_idx = idx[:sample], idx[sample:]
    x_train, x_test, y_train, y_test = X[train_idx,:], X[test_idx,:], Y[train_idx,], Y[test_idx,]
    
    #%% Method 4: using pandas dataframe to split
    import pandas as pd
    df = pd.read_csv(file_path, header=None) # Some csv file (I used some file with 3 columns)
    
    train = df.sample(frac=0.7, random_state=200)
    test = df.drop(train.index)
    

    And for the times, the minimum time to execute out of 3 repetitions of 1000 loops is:

    • Method 1: 0.35883826200006297 seconds
    • Method 2: 1.7157016959999964 seconds
    • Method 3: 1.7876616719995582 seconds
    • Method 4: 0.07562861499991413 seconds

    I hope that's helpful!

    0 讨论(0)
  • 2020-11-27 11:36

    As sklearn.cross_validation module was deprecated, you can use:

    import numpy as np
    from sklearn.model_selection import train_test_split
    X, y = np.arange(10).reshape((5, 2)), range(5)
    
    X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.2, random_state=42)
    
    0 讨论(0)
  • 2020-11-27 11:36

    You may also consider stratified division into training and testing set. Startified division also generates training and testing set randomly but in such a way that original class proportions are preserved. This makes training and testing sets better reflect the properties of the original dataset.

    import numpy as np  
    
    def get_train_test_inds(y,train_proportion=0.7):
        '''Generates indices, making random stratified split into training set and testing sets
        with proportions train_proportion and (1-train_proportion) of initial sample.
        y is any iterable indicating classes of each observation in the sample.
        Initial proportions of classes inside training and 
        testing sets are preserved (stratified sampling).
        '''
    
        y=np.array(y)
        train_inds = np.zeros(len(y),dtype=bool)
        test_inds = np.zeros(len(y),dtype=bool)
        values = np.unique(y)
        for value in values:
            value_inds = np.nonzero(y==value)[0]
            np.random.shuffle(value_inds)
            n = int(train_proportion*len(value_inds))
    
            train_inds[value_inds[:n]]=True
            test_inds[value_inds[n:]]=True
    
        return train_inds,test_inds
    
    y = np.array([1,1,2,2,3,3])
    train_inds,test_inds = get_train_test_inds(y,train_proportion=0.5)
    print y[train_inds]
    print y[test_inds]
    

    This code outputs:

    [1 2 3]
    [1 2 3]
    
    0 讨论(0)
提交回复
热议问题