Randomly shuffle column in Spark RDD or dataframe

前端 未结 4 1606
隐瞒了意图╮
隐瞒了意图╮ 2020-12-31 16:19

Is there anyway I can shuffle a column of an RDD or dataframe such that the entries in that column appear in random order? I\'m not sure which APIs I could use to accomplish

4条回答
  •  孤城傲影
    2020-12-31 16:39

    What about selecting the column to shuffle, orderBy(rand) the column and zip it by index to the existing dataframe?

    import org.apache.spark.sql.functions.rand
    
    def addIndex(df: DataFrame) = spark.createDataFrame(
      // Add index
      df.rdd.zipWithIndex.map{case (r, i) => Row.fromSeq(r.toSeq :+ i)},
      // Create schema
      StructType(df.schema.fields :+ StructField("_index", LongType, false))
    )
    
    case class Entry(name: String, salary: Double)
    
    val r1 = Entry("Max", 2001.21)
    val r2 = Entry("Zhang", 3111.32)
    val r3 = Entry("Bob", 1919.21)
    val r4 = Entry("Paul", 3001.5)
    
    val df = addIndex(spark.createDataFrame(Seq(r1, r2, r3, r4)))
    val df_shuffled = addIndex(df
      .select(col("salary").as("salary_shuffled"))
      .orderBy(rand))
    
    df.join(df_shuffled, Seq("_index"))
      .drop("_index")
      .show(false) 
    
    +-----+-------+---------------+
    |name |salary |salary_shuffled|
    +-----+-------+---------------+
    |Max  |2001.21|3001.5         |
    |Zhang|3111.32|3111.32        |
    |Paul |3001.5 |2001.21        |
    |Bob  |1919.21|1919.21        |
    +-----+-------+---------------+
    

提交回复
热议问题