How to convert rdd object to dataframe in spark

前端 未结 11 2115
慢半拍i
慢半拍i 2020-11-22 14:59

How can I convert an RDD (org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]) to a Dataframe org.apache.spark.sql.DataFrame. I converted a datafram

11条回答
  •  情歌与酒
    2020-11-22 15:24

    This code works perfectly from Spark 2.x with Scala 2.11

    Import necessary classes

    import org.apache.spark.sql.{Row, SparkSession}
    import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType}
    

    Create SparkSession Object, and Here it's spark

    val spark: SparkSession = SparkSession.builder.master("local").getOrCreate
    val sc = spark.sparkContext // Just used to create test RDDs
    

    Let's an RDD to make it DataFrame

    val rdd = sc.parallelize(
      Seq(
        ("first", Array(2.0, 1.0, 2.1, 5.4)),
        ("test", Array(1.5, 0.5, 0.9, 3.7)),
        ("choose", Array(8.0, 2.9, 9.1, 2.5))
      )
    )
    

    Method 1

    Using SparkSession.createDataFrame(RDD obj).

    val dfWithoutSchema = spark.createDataFrame(rdd)
    
    dfWithoutSchema.show()
    +------+--------------------+
    |    _1|                  _2|
    +------+--------------------+
    | first|[2.0, 1.0, 2.1, 5.4]|
    |  test|[1.5, 0.5, 0.9, 3.7]|
    |choose|[8.0, 2.9, 9.1, 2.5]|
    +------+--------------------+
    

    Method 2

    Using SparkSession.createDataFrame(RDD obj) and specifying column names.

    val dfWithSchema = spark.createDataFrame(rdd).toDF("id", "vals")
    
    dfWithSchema.show()
    +------+--------------------+
    |    id|                vals|
    +------+--------------------+
    | first|[2.0, 1.0, 2.1, 5.4]|
    |  test|[1.5, 0.5, 0.9, 3.7]|
    |choose|[8.0, 2.9, 9.1, 2.5]|
    +------+--------------------+
    

    Method 3 (Actual answer to the question)

    This way requires the input rdd should be of type RDD[Row].

    val rowsRdd: RDD[Row] = sc.parallelize(
      Seq(
        Row("first", 2.0, 7.0),
        Row("second", 3.5, 2.5),
        Row("third", 7.0, 5.9)
      )
    )
    

    create the schema

    val schema = new StructType()
      .add(StructField("id", StringType, true))
      .add(StructField("val1", DoubleType, true))
      .add(StructField("val2", DoubleType, true))
    

    Now apply both rowsRdd and schema to createDataFrame()

    val df = spark.createDataFrame(rowsRdd, schema)
    
    df.show()
    +------+----+----+
    |    id|val1|val2|
    +------+----+----+
    | first| 2.0| 7.0|
    |second| 3.5| 2.5|
    | third| 7.0| 5.9|
    +------+----+----+
    

提交回复
热议问题