Parallelizing independent actions on the same DataFrame in Spark

后端 未结 1 721
伪装坚强ぢ
伪装坚强ぢ 2020-12-21 23:03

Let\'s say I have a Spark DataFrame with the following schema:

root
 | -- prob: Double
 | -- word: String

I\'d like to randoml

相关标签:
1条回答
  • 2020-12-21 23:55

    1) You can use one of this DataFrame methods:

    • randomSplit(weights: Array[Double], seed: Long)
    • randomSplitAsList(weights: Array[Double], seed: Long) or
    • sample(withReplacement: Boolean, fraction: Double)

    and then take first two Rows.

    2) Shuffle rows and take first two of them.

    import org.apache.spark.sql.functions.rand
    dataset.orderBy(rand()).limit(n)
    

    3) Or you can use takeSample method of the RDD and then convert it to a DataFrame:

    def takeSample(
          withReplacement: Boolean,
          num: Int,
          seed: Long = Utils.random.nextLong): Array[T]
    

    For example:

    dataframe.rdd.takeSample(true, 1000).toDF()
    
    0 讨论(0)
提交回复
热议问题