How to split data into 3 sets (train, validation and test)?

后端 未结 7 478
无人及你
无人及你 2020-11-22 15:03

I have a pandas dataframe and I wish to divide it to 3 separate sets. I know that using train_test_split from sklearn.cross_validation, one can divide the data

相关标签:
7条回答
  • 2020-11-22 15:35

    It is very convenient to use train_test_split without performing reindexing after dividing to several sets and not writing some additional code. Best answer above does not mention that by separating two times using train_test_split not changing partition sizes won`t give initially intended partition:

    x_train, x_remain = train_test_split(x, test_size=(val_size + test_size))
    

    Then the portion of validation and test sets in the x_remain change and could be counted as

    new_test_size = np.around(test_size / (val_size + test_size), 2)
    # To preserve (new_test_size + new_val_size) = 1.0 
    new_val_size = 1.0 - new_test_size
    
    x_val, x_test = train_test_split(x_remain, test_size=new_test_size)
    

    In this occasion all initial partitions are saved.

    0 讨论(0)
  • 2020-11-22 15:39

    Note:

    Function was written to handle seeding of randomized set creation. You should not rely on set splitting that doesn't randomize the sets.

    import numpy as np
    import pandas as pd
    
    def train_validate_test_split(df, train_percent=.6, validate_percent=.2, seed=None):
        np.random.seed(seed)
        perm = np.random.permutation(df.index)
        m = len(df.index)
        train_end = int(train_percent * m)
        validate_end = int(validate_percent * m) + train_end
        train = df.iloc[perm[:train_end]]
        validate = df.iloc[perm[train_end:validate_end]]
        test = df.iloc[perm[validate_end:]]
        return train, validate, test
    

    Demonstration

    np.random.seed([3,1415])
    df = pd.DataFrame(np.random.rand(10, 5), columns=list('ABCDE'))
    df
    

    train, validate, test = train_validate_test_split(df)
    
    train
    

    validate
    

    test
    

    0 讨论(0)
  • 2020-11-22 15:41

    Numpy solution. We will shuffle the whole dataset first (df.sample(frac=1, random_state=42)) and then split our data set into the following parts:

    • 60% - train set,
    • 20% - validation set,
    • 20% - test set

    In [305]: train, validate, test = \
                  np.split(df.sample(frac=1, random_state=42), 
                           [int(.6*len(df)), int(.8*len(df))])
    
    In [306]: train
    Out[306]:
              A         B         C         D         E
    0  0.046919  0.792216  0.206294  0.440346  0.038960
    2  0.301010  0.625697  0.604724  0.936968  0.870064
    1  0.642237  0.690403  0.813658  0.525379  0.396053
    9  0.488484  0.389640  0.599637  0.122919  0.106505
    8  0.842717  0.793315  0.554084  0.100361  0.367465
    7  0.185214  0.603661  0.217677  0.281780  0.938540
    
    In [307]: validate
    Out[307]:
              A         B         C         D         E
    5  0.806176  0.008896  0.362878  0.058903  0.026328
    6  0.145777  0.485765  0.589272  0.806329  0.703479
    
    In [308]: test
    Out[308]:
              A         B         C         D         E
    4  0.521640  0.332210  0.370177  0.859169  0.401087
    3  0.333348  0.964011  0.083498  0.670386  0.169619
    

    [int(.6*len(df)), int(.8*len(df))] - is an indices_or_sections array for numpy.split().

    Here is a small demo for np.split() usage - let's split 20-elements array into the following parts: 80%, 10%, 10%:

    In [45]: a = np.arange(1, 21)
    
    In [46]: a
    Out[46]: array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])
    
    In [47]: np.split(a, [int(.8 * len(a)), int(.9 * len(a))])
    Out[47]:
    [array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16]),
     array([17, 18]),
     array([19, 20])]
    
    0 讨论(0)
  • 2020-11-22 15:45

    Here is a Python function that splits a Pandas dataframe into train, validation, and test dataframes with stratified sampling. It performs this split by calling scikit-learn's function train_test_split() twice.

    import pandas as pd
    from sklearn.model_selection import train_test_split
    
    def split_stratified_into_train_val_test(df_input, stratify_colname='y',
                                             frac_train=0.6, frac_val=0.15, frac_test=0.25,
                                             random_state=None):
        '''
        Splits a Pandas dataframe into three subsets (train, val, and test)
        following fractional ratios provided by the user, where each subset is
        stratified by the values in a specific column (that is, each subset has
        the same relative frequency of the values in the column). It performs this
        splitting by running train_test_split() twice.
    
        Parameters
        ----------
        df_input : Pandas dataframe
            Input dataframe to be split.
        stratify_colname : str
            The name of the column that will be used for stratification. Usually
            this column would be for the label.
        frac_train : float
        frac_val   : float
        frac_test  : float
            The ratios with which the dataframe will be split into train, val, and
            test data. The values should be expressed as float fractions and should
            sum to 1.0.
        random_state : int, None, or RandomStateInstance
            Value to be passed to train_test_split().
    
        Returns
        -------
        df_train, df_val, df_test :
            Dataframes containing the three splits.
        '''
    
        if frac_train + frac_val + frac_test != 1.0:
            raise ValueError('fractions %f, %f, %f do not add up to 1.0' % \
                             (frac_train, frac_val, frac_test))
    
        if stratify_colname not in df_input.columns:
            raise ValueError('%s is not a column in the dataframe' % (stratify_colname))
    
        X = df_input # Contains all columns.
        y = df_input[[stratify_colname]] # Dataframe of just the column on which to stratify.
    
        # Split original dataframe into train and temp dataframes.
        df_train, df_temp, y_train, y_temp = train_test_split(X,
                                                              y,
                                                              stratify=y,
                                                              test_size=(1.0 - frac_train),
                                                              random_state=random_state)
    
        # Split the temp dataframe into val and test dataframes.
        relative_frac_test = frac_test / (frac_val + frac_test)
        df_val, df_test, y_val, y_test = train_test_split(df_temp,
                                                          y_temp,
                                                          stratify=y_temp,
                                                          test_size=relative_frac_test,
                                                          random_state=random_state)
    
        assert len(df_input) == len(df_train) + len(df_val) + len(df_test)
    
        return df_train, df_val, df_test
    

    Below is a complete working example.

    Consider a dataset that has a label upon which you want to perform the stratification. This label has its own distribution in the original dataset, say 75% foo, 15% bar and 10% baz. Now let's split the dataset into train, validation, and test into subsets using a 60/20/20 ratio, where each split retains the same distribution of the labels. See the illustration below:

    Here is the example dataset:

    df = pd.DataFrame( { 'A': list(range(0, 100)),
                         'B': list(range(100, 0, -1)),
                         'label': ['foo'] * 75 + ['bar'] * 15 + ['baz'] * 10 } )
    
    df.head()
    #    A    B label
    # 0  0  100   foo
    # 1  1   99   foo
    # 2  2   98   foo
    # 3  3   97   foo
    # 4  4   96   foo
    
    df.shape
    # (100, 3)
    
    df.label.value_counts()
    # foo    75
    # bar    15
    # baz    10
    # Name: label, dtype: int64
    

    Now, let's call the split_stratified_into_train_val_test() function from above to get train, validation, and test dataframes following a 60/20/20 ratio.

    df_train, df_val, df_test = \
        split_stratified_into_train_val_test(df, stratify_colname='label', frac_train=0.60, frac_val=0.20, frac_test=0.20)
    

    The three dataframes df_train, df_val, and df_test contain all the original rows but their sizes will follow the above ratio.

    df_train.shape
    #(60, 3)
    
    df_val.shape
    #(20, 3)
    
    df_test.shape
    #(20, 3)
    

    Further, each of the three splits will have the same distribution of the label, namely 75% foo, 15% bar and 10% baz.

    df_train.label.value_counts()
    # foo    45
    # bar     9
    # baz     6
    # Name: label, dtype: int64
    
    df_val.label.value_counts()
    # foo    15
    # bar     3
    # baz     2
    # Name: label, dtype: int64
    
    df_test.label.value_counts()
    # foo    15
    # bar     3
    # baz     2
    # Name: label, dtype: int64
    
    0 讨论(0)
  • 2020-11-22 15:45

    In the case of supervised learning, you may want to split both X and y (where X is your input and y the ground truth output). You just have to pay attention to shuffle X and y the same way before splitting.

    Here, either X and y are in the same dataframe, so we shuffle them, separate them and apply the split for each (just like in chosen answer), or X and y are in two different dataframes, so we shuffle X, reorder y the same way as the shuffled X and apply the split to each.

    # 1st case: df contains X and y (where y is the "target" column of df)
    df_shuffled = df.sample(frac=1)
    X_shuffled = df_shuffled.drop("target", axis = 1)
    y_shuffled = df_shuffled["target"]
    
    # 2nd case: X and y are two separated dataframes
    X_shuffled = X.sample(frac=1)
    y_shuffled = y[X_shuffled.index]
    
    # We do the split as in the chosen answer
    X_train, X_validation, X_test = np.split(X_shuffled, [int(0.6*len(X)),int(0.8*len(X))])
    y_train, y_validation, y_test = np.split(y_shuffled, [int(0.6*len(X)),int(0.8*len(X))])
    
    0 讨论(0)
  • 2020-11-22 15:51

    However, one approach to dividing the dataset into train, test, cv with 0.6, 0.2, 0.2 would be to use the train_test_split method twice.

    from sklearn.model_selection import train_test_split
    
    x, x_test, y, y_test = train_test_split(xtrain,labels,test_size=0.2,train_size=0.8)
    x_train, x_cv, y_train, y_cv = train_test_split(x,y,test_size = 0.25,train_size =0.75)
    
    0 讨论(0)
提交回复
热议问题