How to select an exact number of random rows from DataFrame

前端 未结 1 1424
清酒与你
清酒与你 2021-01-25 02:53

How can I select an exact number of random rows from a DataFrame efficiently? The data contains an index column that can be used. If I have to use maximum size,

1条回答
  •  傲寒
    傲寒 (楼主)
    2021-01-25 03:30

    A possible approach is to calculate the number of rows using .count(), then use sample() from python's random library to generate a random sequence of arbitrary length from this range. Lastly use the resulting list of numbers vals to subset your index column.

    import random 
    def sampler(df, col, records):
    
      # Calculate number of rows
      colmax = df.count()
    
      # Create random sample from range
      vals = random.sample(range(1, colmax), records)
    
      # Use 'vals' to filter DataFrame using 'isin'
      return df.filter(df[col].isin(vals))
    

    Example:

    df = sc.parallelize([(1,1),(2,1),
                         (3,1),(4,0),
                         (5,0),(6,1),
                         (7,1),(8,0),
                         (9,0),(10,1)]).toDF(["a","b"])
    
    sampler(df,"a",3).show()
    +---+---+
    |  a|  b|
    +---+---+
    |  3|  1|
    |  4|  0|
    |  6|  1|
    +---+---+
    

    0 讨论(0)
提交回复
热议问题