Say i have a dataframe with 100,000 entries and want to split it into 100 sections of 1000 entries.
How do i take a random sample of say size 50 of just one of the 100 sections. the data set is already ordered such that the first 1000 results are the first section the next section the next and so on.
many thanks
You can use the sample
method*:
In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8]], columns=["A", "B"])
In [12]: df.sample(2)
Out[12]:
A B
0 1 2
2 5 6
In [13]: df.sample(2)
Out[13]:
A B
3 7 8
0 1 2
*On one of the section DataFrames.
Note: If you have a larger sample size that the size of the DataFrame this will raise an error unless you sample with replacement.
In [14]: df.sample(5)
ValueError: Cannot take a larger sample than population when 'replace=False'
In [15]: df.sample(5, replace=True)
Out[15]:
A B
0 1 2
1 3 4
2 5 6
3 7 8
1 3 4
One solution is to use the choice
function from numpy.
Say you want 50 entries out of 100, you can use:
import numpy as np
chosen_idx = np.random.choice(1000, replace=False, size=50)
df_trimmed = df.iloc[chosen_idx]
This is of course not considering your block structure. If you want a 50 item sample from block i
for example, you can do:
import numpy as np
block_start_idx = 1000 * i
chosen_idx = np.random.choice(1000, replace=False, size=50)
df_trimmed_from_block_i = df.iloc[block_start_idx + chosen_idx]
This is a nice place for recursion.
def main2():
rows = 8 # say you have 8 rows, real data will need len(rows) for int
rands = []
for i in range(rows):
gen = fun(rands)
rands.append(gen)
print(rands) # now range through random values
def fun(rands):
gen = np.random.randint(0, 8)
if gen in rands:
a = fun(rands)
return a
else: return gen
if __name__ == "__main__":
main2()
output: [6, 0, 7, 1, 3, 5, 4, 2]
来源:https://stackoverflow.com/questions/38085547/random-sample-of-a-subset-of-a-dataframe-in-pandas