问题
I was applying .sample with random_state
set to a constant and after using set_index
it started selecting different rows. A member dropped that was previously included in the subset. I'm unsure how seeding selects rows. Does it make sense or did something go wrong?
Here is what was done:
df.set_index('id',inplace=True, verify_integrity=True)
df_small_F = df.loc[df['gender']=='F'].apply(lambda x: x.sample(n=30000, random_state=47))
df_small_M = df.loc[df['gender']=='M'].apply(lambda x: x.sample(n=30000, random_state=46))
df_small=pd.concat([df_small_F,df_small_M],verify_integrity=True)
When I sort df_small by index and print, it produces different results.
回答1:
Applying .sort_index() after reading in the data and before performing .sample() corrected the issue. As long as the data remains the same, this will produce the same sample everytime.
回答2:
When sampling rows (without weight), the only things that matter are n
, the number of rows, and whether or not you choose replacement. This generates the .iloc
indices to take, regardless of the data.
For rows, sampling occurs as;
axis_length = self.shape[0] # DataFrame length
rs = pd.core.common.random_state(random_state)
locs = rs.choice(axis_length, size=n, replace=replace, p=weights) # np.random_choice
return self.take(locs, axis=axis, is_copy=False)
Just to illustrate the point
Sample Data
import pandas as pd
import numpy as np
n = 100000
np.random.seed(123)
df = pd.DataFrame({'id': list(range(n)), 'gender': np.random.choice(['M', 'F'], n)})
df1 = pd.DataFrame({'id': list(range(n)), 'gender': ['M']},
index=np.random.choice(['foo', 'bar', np.NaN], n)).assign(blah=1)
Sampling will always choose row 42083
(integer array index): df.iloc[42803]
for this seed and length:
df.sample(n=1, random_state=123)
# id gender
#42083 42083 M
df1.sample(n=1, random_state=123)
# id gender blah
#foo 42083 M 1
df1.reset_index().shift(10).sample(n=1, random_state=123)
# index id gender blah
#42083 nan 42073.0 M 1.0
Even numpy:
np.random.seed(123)
np.random.choice(df.shape[0], size=1, replace=False)
#array([42083])
来源:https://stackoverflow.com/questions/55360354/random-seed-chose-different-rows