Spark - Random Number Generation

后端未结

关注

 4  1645

I have written a method that must consider a random number to simulate a Bernoulli distribution. I am using random.nextDouble to generate a number between 0 and

相关标签:

4条回答

暗喜

2021-01-02 00:35
The reason why the same sequence is repeated is that the random generator is created and initialized with a seed before the data is partitioned. Each partition then starts from the same random seed. Maybe not the most efficient way to do it, but the following should work:
```
val myClass = new MyClass()
val M = 3

for (m <- 1 to M) {
  val newDF = sqlContext.createDataFrame(myDF
    .map{ 
       val rand = scala.util.Random
       row => RowFactory
      .create(row.getString(0),
        myClass.myMethod(row.getString(2), rand.nextDouble())
    }, myDF.schema)
}
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

旧时难觅i

2021-01-02 00:40

Just use the SQL function rand:

import org.apache.spark.sql.functions._

//df: org.apache.spark.sql.DataFrame = [key: int]

df.select($"key", rand() as "rand").show
+---+-------------------+
|key|               rand|
+---+-------------------+
|  1| 0.8635073400704648|
|  2| 0.6870153659986652|
|  3|0.18998048357873532|
+---+-------------------+


df.select($"key", rand() as "rand").show
+---+------------------+
|key|              rand|
+---+------------------+
|  1|0.3422484248879837|
|  2|0.2301384925817671|
|  3|0.6959421970071372|
+---+------------------+

0 讨论(0)

旧时难觅i

2021-01-02 00:50
Using Spark Dataset API, perhaps for use in an accumulator:
```
df.withColumn("_n", substring(rand(),3,4).cast("bigint"))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
醉梦人生

2021-01-02 00:51
According to this post, the best solution is not to put the new scala.util.Random inside the map, nor completely outside (ie. in the driver code), but in an intermediate mapPartitionsWithIndex:
```
import scala.util.Random
val myAppSeed = 91234
val newRDD = myRDD.mapPartitionsWithIndex { (indx, iter) =>
   val rand = new scala.util.Random(indx+myAppSeed)
   iter.map(x => (x, Array.fill(10)(rand.nextDouble)))
}
```
0 讨论(0)
发布评论:

提交评论
- 加载中...