Random Sample of a subset of a dataframe in Pandas

问题

Say i have a dataframe with 100,000 entries and want to split it into 100 sections of 1000 entries.

How do i take a random sample of say size 50 of just one of the 100 sections. the data set is already ordered such that the first 1000 results are the first section the next section the next and so on.

many thanks

回答1:

You can use the sample method*:

In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8]], columns=["A", "B"])

In [12]: df.sample(2)
Out[12]:
   A  B
0  1  2
2  5  6

In [13]: df.sample(2)
Out[13]:
   A  B
3  7  8
0  1  2

*On one of the section DataFrames.

Note: If you have a larger sample size that the size of the DataFrame this will raise an error unless you sample with replacement.

In [14]: df.sample(5)
ValueError: Cannot take a larger sample than population when 'replace=False'

In [15]: df.sample(5, replace=True)
Out[15]:
   A  B
0  1  2
1  3  4
2  5  6
3  7  8
1  3  4

回答2:

One solution is to use the choice function from numpy.

Say you want 50 entries out of 100, you can use:

import numpy as np
chosen_idx = np.random.choice(1000, replace=False, size=50)
df_trimmed = df.iloc[chosen_idx]

This is of course not considering your block structure. If you want a 50 item sample from block i for example, you can do:

import numpy as np
block_start_idx = 1000 * i
chosen_idx = np.random.choice(1000, replace=False, size=50)
df_trimmed_from_block_i = df.iloc[block_start_idx + chosen_idx]

回答3:

This is a nice place for recursion.

def main2():
    rows = 8  # say you have 8 rows, real data will need len(rows) for int
    rands = []
    for i in range(rows):
        gen = fun(rands)
        rands.append(gen)
    print(rands)  # now range through random values


def fun(rands):
    gen = np.random.randint(0, 8)
    if gen in rands:
        a = fun(rands)
        return a
    else: return gen


if __name__ == "__main__":
    main2()

output: [6, 0, 7, 1, 3, 5, 4, 2]

来源：https://stackoverflow.com/questions/38085547/random-sample-of-a-subset-of-a-dataframe-in-pandas

标签

python

pandas

sample

random-sample