发表新帖

发表新帖

How to do Stratified sampling with Spark DataFrames? [duplicate]

后端未结

关注

 1  815

礼貌的吻别 2021-01-06 13:14

1条回答

清酒与你 (楼主)

2021-01-06 13:20
Spark 1.1 added stratified sampling routines SampleByKey and SampleByKeyExact to Spark Core, so since then they are available without MLLib dependencies.

These two functions are PairRDDFunctions and belong to key-value RDD[(K,T)]. Also DataFrames do not have keys. You'd have to use underlying RDD - something like below:
```
val df = ... // your dataframe
val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key

val sample = df.rdd.keyBy(x=>x(0)).sampleByKey(false, fractions)
```
Note that sample is RDD not DataFrame now, but you can easily convert it back to DataFrame since you already have schema defined for df.
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题