Scikit-learn balanced subsampling

前端 未结 13 1553
终归单人心
终归单人心 2020-12-02 10:34

I\'m trying to create N balanced random subsamples of my large unbalanced dataset. Is there a way to do this simply with scikit-learn / pandas or do I have to implement it m

相关标签:
13条回答
  • 2020-12-02 10:57

    A short, pythonic solution to balance a pandas DataFrame either by subsampling (uspl=True) or oversampling (uspl=False), balanced by a specified column in that dataframe that has two or more values.

    For uspl=True, this code will take a random sample without replacement of size equal to the smallest stratum from all strata. For uspl=False, this code will take a random sample with replacement of size equal to the largest stratum from all strata.

    def balanced_spl_by(df, lblcol, uspl=True):
        datas_l = [ df[df[lblcol]==l].copy() for l in list(set(df[lblcol].values)) ]
        lsz = [f.shape[0] for f in datas_l ]
        return pd.concat([f.sample(n = (min(lsz) if uspl else max(lsz)), replace = (not uspl)).copy() for f in datas_l ], axis=0 ).sample(frac=1) 
    

    This will only work with a Pandas DataFrame, but that seems to be a common application, and restricting it to Pandas DataFrames significantly shortens the code as far as I can tell.

    0 讨论(0)
  • 2020-12-02 10:58

    Here is a version of the above code that works for multiclass groups (in my tested case group 0, 1, 2, 3, 4)

    import numpy as np
    def balanced_sample_maker(X, y, sample_size, random_seed=None):
        """ return a balanced data set by sampling all classes with sample_size 
            current version is developed on assumption that the positive
            class is the minority.
    
        Parameters:
        ===========
        X: {numpy.ndarrray}
        y: {numpy.ndarray}
        """
        uniq_levels = np.unique(y)
        uniq_counts = {level: sum(y == level) for level in uniq_levels}
    
        if not random_seed is None:
            np.random.seed(random_seed)
    
        # find observation index of each class levels
        groupby_levels = {}
        for ii, level in enumerate(uniq_levels):
            obs_idx = [idx for idx, val in enumerate(y) if val == level]
            groupby_levels[level] = obs_idx
        # oversampling on observations of each label
        balanced_copy_idx = []
        for gb_level, gb_idx in groupby_levels.iteritems():
            over_sample_idx = np.random.choice(gb_idx, size=sample_size, replace=True).tolist()
            balanced_copy_idx+=over_sample_idx
        np.random.shuffle(balanced_copy_idx)
    
        return (X[balanced_copy_idx, :], y[balanced_copy_idx], balanced_copy_idx)
    

    This also returns the indices so they can be used for other datasets and to keep track of how frequently each data set was used (helpful for training)

    0 讨论(0)
  • 2020-12-02 11:00

    Here is my first version that seems to be working fine, feel free to copy or make suggestions on how it could be more efficient (I have quite a long experience with programming in general but not that long with python or numpy)

    This function creates single random balanced subsample.

    edit: The subsample size now samples down minority classes, this should probably be changed.

    def balanced_subsample(x,y,subsample_size=1.0):
    
        class_xs = []
        min_elems = None
    
        for yi in np.unique(y):
            elems = x[(y == yi)]
            class_xs.append((yi, elems))
            if min_elems == None or elems.shape[0] < min_elems:
                min_elems = elems.shape[0]
    
        use_elems = min_elems
        if subsample_size < 1:
            use_elems = int(min_elems*subsample_size)
    
        xs = []
        ys = []
    
        for ci,this_xs in class_xs:
            if len(this_xs) > use_elems:
                np.random.shuffle(this_xs)
    
            x_ = this_xs[:use_elems]
            y_ = np.empty(use_elems)
            y_.fill(ci)
    
            xs.append(x_)
            ys.append(y_)
    
        xs = np.concatenate(xs)
        ys = np.concatenate(ys)
    
        return xs,ys
    

    For anyone trying to make the above work with a Pandas DataFrame, you need to make a couple of changes:

    1. Replace the np.random.shuffle line with

      this_xs = this_xs.reindex(np.random.permutation(this_xs.index))

    2. Replace the np.concatenate lines with

      xs = pd.concat(xs) ys = pd.Series(data=np.concatenate(ys),name='target')

    0 讨论(0)
  • 2020-12-02 11:04

    A version for pandas Series:

    import numpy as np
    
    def balanced_subsample(y, size=None):
    
        subsample = []
    
        if size is None:
            n_smp = y.value_counts().min()
        else:
            n_smp = int(size / len(y.value_counts().index))
    
        for label in y.value_counts().index:
            samples = y[y == label].index.values
            index_range = range(samples.shape[0])
            indexes = np.random.choice(index_range, size=n_smp, replace=False)
            subsample += samples[indexes].tolist()
    
        return subsample
    
    0 讨论(0)
  • 2020-12-02 11:04

    A slight modification to the top answer by mikkom.

    If you want to preserve ordering of the larger class data, ie. you don't want to shuffle.

    Instead of

        if len(this_xs) > use_elems:
            np.random.shuffle(this_xs)
    

    do this

            if len(this_xs) > use_elems:
                ratio = len(this_xs) / use_elems
                this_xs = this_xs[::ratio]
    
    0 讨论(0)
  • 2020-12-02 11:04

    My subsampler version, hope this helps

    def subsample_indices(y, size):
        indices = {}
        target_values = set(y_train)
        for t in target_values:
            indices[t] = [i for i in range(len(y)) if y[i] == t]
        min_len = min(size, min([len(indices[t]) for t in indices]))
        for t in indices:
            if len(indices[t]) > min_len:
                indices[t] = random.sample(indices[t], min_len)
        return indices
    
    x = [1, 1, 1, 1, 1, -1, -1, -1, -1, -1, 1, 1, 1, -1]
    j = subsample_indices(x, 2)
    print j
    print [x[t] for t in j[-1]]
    print [x[t] for t in j[1]]
    
    0 讨论(0)
提交回复
热议问题