Random Seed Chose Different Rows

笑着哭i 提交于 2019-12-13 23:57:21

问题


I was applying .sample with random_state set to a constant and after using set_index it started selecting different rows. A member dropped that was previously included in the subset. I'm unsure how seeding selects rows. Does it make sense or did something go wrong?

Here is what was done:

df.set_index('id',inplace=True, verify_integrity=True)

df_small_F = df.loc[df['gender']=='F'].apply(lambda x: x.sample(n=30000, random_state=47))

df_small_M = df.loc[df['gender']=='M'].apply(lambda x: x.sample(n=30000, random_state=46))

df_small=pd.concat([df_small_F,df_small_M],verify_integrity=True)

When I sort df_small by index and print, it produces different results.


回答1:


Applying .sort_index() after reading in the data and before performing .sample() corrected the issue. As long as the data remains the same, this will produce the same sample everytime.




回答2:


When sampling rows (without weight), the only things that matter are n, the number of rows, and whether or not you choose replacement. This generates the .iloc indices to take, regardless of the data.

For rows, sampling occurs as;

axis_length = self.shape[0]  # DataFrame length

rs = pd.core.common.random_state(random_state)  
locs = rs.choice(axis_length, size=n, replace=replace, p=weights)  # np.random_choice
return self.take(locs, axis=axis, is_copy=False)

Just to illustrate the point

Sample Data

import pandas as pd
import numpy as np

n = 100000
np.random.seed(123)
df = pd.DataFrame({'id': list(range(n)), 'gender': np.random.choice(['M', 'F'], n)})
df1 = pd.DataFrame({'id': list(range(n)), 'gender': ['M']}, 
                    index=np.random.choice(['foo', 'bar', np.NaN], n)).assign(blah=1)

Sampling will always choose row 42083 (integer array index): df.iloc[42803] for this seed and length:

df.sample(n=1, random_state=123)
#          id gender
#42083  42083      M

df1.sample(n=1, random_state=123)
#        id gender  blah
#foo  42083      M     1

df1.reset_index().shift(10).sample(n=1, random_state=123)
#      index       id gender  blah
#42083   nan  42073.0      M   1.0

Even numpy:

np.random.seed(123)
np.random.choice(df.shape[0], size=1, replace=False)
#array([42083])


来源:https://stackoverflow.com/questions/55360354/random-seed-chose-different-rows

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!