Random row selection in Pandas dataframe

后端 未结 6 1183
深忆病人
深忆病人 2020-11-28 02:52

Is there a way to select random rows from a DataFrame in Pandas.

In R, using the car package, there is a useful function some(x, n) which is similar to h

相关标签:
6条回答
  • 2020-11-28 03:07

    With pandas version 0.16.1 and up, there is now a DataFrame.sample method built-in:

    import pandas
    
    df = pandas.DataFrame(pandas.np.random.random(100))
    
    # Randomly sample 70% of your dataframe
    df_percent = df.sample(frac=0.7)
    
    # Randomly sample 7 elements from your dataframe
    df_elements = df.sample(n=7)
    

    For either approach above, you can get the rest of the rows by doing:

    df_rest = df.loc[~df.index.isin(df_percent.index)]
    
    0 讨论(0)
  • 2020-11-28 03:11

    Actually this will give you repeated indices np.random.random_integers(0, len(df), N) where N is a large number.

    0 讨论(0)
  • 2020-11-28 03:12

    sample

    As of v0.20.0, you can use pd.DataFrame.sample, which can be used to return a random sample of a fixed number rows, or a percentage of rows:

    df = df.sample(n=k)     # k rows
    df = df.sample(frac=k)  # int(len(df.index) * k) rows
    

    For reproducibility, you can specify an integer random_state, equivalent to using np.ramdom.seed. So, instead of setting, for example, np.random.seed = 0, you can:

    df = df.sample(n=k, random_state=0)
    
    0 讨论(0)
  • 2020-11-28 03:19

    Something like this?

    import random
    
    def some(x, n):
        return x.ix[random.sample(x.index, n)]
    

    Note: As of Pandas v0.20.0, ix has been deprecated in favour of loc for label based indexing.

    0 讨论(0)
  • 2020-11-28 03:20

    Below line will randomly select n number of rows out of the total existing row numbers from the dataframe df without replacement.

    df=df.take(np.random.permutation(len(df))[:n])

    0 讨论(0)
  • 2020-11-28 03:26

    The best way to do this is with the sample function from the random module,

    import numpy as np
    import pandas as pd
    from random import sample
    
    # given data frame df
    
    # create random index
    rindex =  np.array(sample(xrange(len(df)), 10))
    
    # get 10 random rows from df
    dfr = df.ix[rindex]
    
    0 讨论(0)
提交回复
热议问题