Replicate Spark Row N-times

前端 未结 3 1744
南旧
南旧 2020-12-09 07:02

I want to duplicate a Row in a DataFrame, how can I do that?

For example, I have a DataFrame consisting of 1 Row, and I want to make a DataFrame with 100 identical R

相关标签:
3条回答
  • 2020-12-09 07:05

    You could use a flatMap, or a for-comprehension, like it is described here.

    I encourage you to use DataSets every time you can, but if it's not possible, the last example in the link works with DataFrames as well:

    val df = Seq(
      (0, "Lorem ipsum dolor", 1.0, List("prp1", "prp2", "prp3"))
    ).toDF("id", "text", "value", "properties")
    
    val df2 = for {
      row <- df
      p <- row.getAs[Seq[String]]("properties")
    } yield (row.getAs[Int]("id"), row.getAs[String]("text"), row.getAs[Double]("value"), p)
    

    Also keep in mind that explode is deprecated, see here.

    0 讨论(0)
  • 2020-12-09 07:07

    You could pick out the single row, make a list with a hundred elements, populated with that row and convert it back into a dataframe.

    import org.apache.spark.sql.DataFrame
    
    val testDf = sc.parallelize(Seq(
        (1,2,3), (4,5,6)
    )).toDF("one", "two", "three")
    
    def replicateDf(n: Int, df: DataFrame) = sqlContext.createDataFrame(
        sc.parallelize(List.fill(n)(df.take(1)(0)).toSeq), 
        df.schema)
    
    val replicatedDf = replicateDf(100, testDf)
    
    0 讨论(0)
  • 2020-12-09 07:08

    You can add a column with a literal value of an Array with size 100, and then use explode to make each of its elements create its own row; Then, just get rid of this "dummy" column:

    import org.apache.spark.sql.functions._
    
    val result = singleRowDF
      .withColumn("dummy", explode(array((1 until 100).map(lit): _*)))
      .selectExpr(singleRowDF.columns: _*)
    
    0 讨论(0)
提交回复
热议问题