I\'m trying to create N balanced random subsamples of my large unbalanced dataset. Is there a way to do this simply with scikit-learn / pandas or do I have to implement it m
A short, pythonic solution to balance a pandas DataFrame either by subsampling (uspl=True
) or oversampling (uspl=False
), balanced by a specified column in that dataframe that has two or more values.
For uspl=True
, this code will take a random sample without replacement of size equal to the smallest stratum from all strata. For uspl=False
, this code will take a random sample with replacement of size equal to the largest stratum from all strata.
def balanced_spl_by(df, lblcol, uspl=True):
datas_l = [ df[df[lblcol]==l].copy() for l in list(set(df[lblcol].values)) ]
lsz = [f.shape[0] for f in datas_l ]
return pd.concat([f.sample(n = (min(lsz) if uspl else max(lsz)), replace = (not uspl)).copy() for f in datas_l ], axis=0 ).sample(frac=1)
This will only work with a Pandas DataFrame, but that seems to be a common application, and restricting it to Pandas DataFrames significantly shortens the code as far as I can tell.
Here is a version of the above code that works for multiclass groups (in my tested case group 0, 1, 2, 3, 4)
import numpy as np
def balanced_sample_maker(X, y, sample_size, random_seed=None):
""" return a balanced data set by sampling all classes with sample_size
current version is developed on assumption that the positive
class is the minority.
Parameters:
===========
X: {numpy.ndarrray}
y: {numpy.ndarray}
"""
uniq_levels = np.unique(y)
uniq_counts = {level: sum(y == level) for level in uniq_levels}
if not random_seed is None:
np.random.seed(random_seed)
# find observation index of each class levels
groupby_levels = {}
for ii, level in enumerate(uniq_levels):
obs_idx = [idx for idx, val in enumerate(y) if val == level]
groupby_levels[level] = obs_idx
# oversampling on observations of each label
balanced_copy_idx = []
for gb_level, gb_idx in groupby_levels.iteritems():
over_sample_idx = np.random.choice(gb_idx, size=sample_size, replace=True).tolist()
balanced_copy_idx+=over_sample_idx
np.random.shuffle(balanced_copy_idx)
return (X[balanced_copy_idx, :], y[balanced_copy_idx], balanced_copy_idx)
This also returns the indices so they can be used for other datasets and to keep track of how frequently each data set was used (helpful for training)
Here is my first version that seems to be working fine, feel free to copy or make suggestions on how it could be more efficient (I have quite a long experience with programming in general but not that long with python or numpy)
This function creates single random balanced subsample.
edit: The subsample size now samples down minority classes, this should probably be changed.
def balanced_subsample(x,y,subsample_size=1.0):
class_xs = []
min_elems = None
for yi in np.unique(y):
elems = x[(y == yi)]
class_xs.append((yi, elems))
if min_elems == None or elems.shape[0] < min_elems:
min_elems = elems.shape[0]
use_elems = min_elems
if subsample_size < 1:
use_elems = int(min_elems*subsample_size)
xs = []
ys = []
for ci,this_xs in class_xs:
if len(this_xs) > use_elems:
np.random.shuffle(this_xs)
x_ = this_xs[:use_elems]
y_ = np.empty(use_elems)
y_.fill(ci)
xs.append(x_)
ys.append(y_)
xs = np.concatenate(xs)
ys = np.concatenate(ys)
return xs,ys
For anyone trying to make the above work with a Pandas DataFrame, you need to make a couple of changes:
Replace the np.random.shuffle
line with
this_xs = this_xs.reindex(np.random.permutation(this_xs.index))
Replace the np.concatenate
lines with
xs = pd.concat(xs)
ys = pd.Series(data=np.concatenate(ys),name='target')
A version for pandas Series:
import numpy as np
def balanced_subsample(y, size=None):
subsample = []
if size is None:
n_smp = y.value_counts().min()
else:
n_smp = int(size / len(y.value_counts().index))
for label in y.value_counts().index:
samples = y[y == label].index.values
index_range = range(samples.shape[0])
indexes = np.random.choice(index_range, size=n_smp, replace=False)
subsample += samples[indexes].tolist()
return subsample
A slight modification to the top answer by mikkom.
If you want to preserve ordering of the larger class data, ie. you don't want to shuffle.
Instead of
if len(this_xs) > use_elems:
np.random.shuffle(this_xs)
do this
if len(this_xs) > use_elems:
ratio = len(this_xs) / use_elems
this_xs = this_xs[::ratio]
My subsampler version, hope this helps
def subsample_indices(y, size):
indices = {}
target_values = set(y_train)
for t in target_values:
indices[t] = [i for i in range(len(y)) if y[i] == t]
min_len = min(size, min([len(indices[t]) for t in indices]))
for t in indices:
if len(indices[t]) > min_len:
indices[t] = random.sample(indices[t], min_len)
return indices
x = [1, 1, 1, 1, 1, -1, -1, -1, -1, -1, 1, 1, 1, -1]
j = subsample_indices(x, 2)
print j
print [x[t] for t in j[-1]]
print [x[t] for t in j[1]]