Dataframe sample in Apache spark | Scala

后端 未结 7 2067
北海茫月
北海茫月 2020-12-05 07:20

I\'m trying to take out samples from two dataframes wherein I need the ratio of count maintained. eg

df1.count() = 10
df2.count() = 1000

noOfSamples = 10


        
相关标签:
7条回答
  • 2020-12-05 08:05

    To answer your question, is there anyway we can specify the number of rows to be sampled?

    I recently needed to sample a certain number of rows from a spark data frame. I followed the below process,

    1. Convert the spark data frame to rdd. Example: df_test.rdd

    2. RDD has a functionality called takeSample which allows you to give the number of samples you need with a seed number. Example: df_test.rdd.takeSample(withReplacement, Number of Samples, Seed)

    3. Convert RDD back to spark data frame using sqlContext.createDataFrame()

    Above process combined to single step:

    Data Frame (or Population) I needed to Sample from has around 8,000 records: df_grp_1

    df_grp_1
    test1 = sqlContext.createDataFrame(df_grp_1.rdd.takeSample(False,125,seed=115))
    

    test1 data frame will have 125 sampled records.

    0 讨论(0)
提交回复
热议问题