Let\'s say I have a Spark DataFrame
with the following schema:
root
| -- prob: Double
| -- word: String
I\'d like to randoml
1) You can use one of this DataFrame methods:
randomSplit(weights: Array[Double], seed: Long)
randomSplitAsList(weights: Array[Double], seed: Long)
or sample(withReplacement: Boolean, fraction: Double)
and then take first two Rows.
2) Shuffle rows and take first two of them.
import org.apache.spark.sql.functions.rand
dataset.orderBy(rand()).limit(n)
3) Or you can use takeSample
method of the RDD and then convert it to a DataFrame:
def takeSample(
withReplacement: Boolean,
num: Int,
seed: Long = Utils.random.nextLong): Array[T]
For example:
dataframe.rdd.takeSample(true, 1000).toDF()