Suppose I have some observations, each with an indicated class from 1
to n
. Each of these classes may not necessarily occur equally in the data set.>
For more elegance you can do this:
df.groupby('classes').apply(lambda x: x.sample(sample_size))
You can make the sample_size
a function of group size to sample with equal probabilities (or proportionately):
nrows = len(df)
total_sample_size = 1e4
df.groupby('classes').\
apply(lambda x: x.sample(int((x.count()/nrows)*total_sample_size)))
It won't result in the exact number of rows as total_sample_size
but sampling will be more proportional than the naive method.