Spark 1.1 added stratified sampling routines SampleByKey
and SampleByKeyExact
to Spark Core, so since then they are available without MLLib dependencies.
These two functions are PairRDDFunctions
and belong to key-value RDD[(K,T)]
. Also DataFrames do not have keys. You'd have to use underlying RDD - something like below:
val df = ... // your dataframe
val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
val sample = df.rdd.keyBy(x=>x(0)).sampleByKey(false, fractions)
Note that sample
is RDD not DataFrame now, but you can easily convert it back to DataFrame since you already have schema defined for df
.