How to convert rdd object to dataframe in spark

前端 未结 11 2126
慢半拍i
慢半拍i 2020-11-22 14:59

How can I convert an RDD (org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]) to a Dataframe org.apache.spark.sql.DataFrame. I converted a datafram

相关标签:
11条回答
  • 2020-11-22 15:24

    This code works perfectly from Spark 2.x with Scala 2.11

    Import necessary classes

    import org.apache.spark.sql.{Row, SparkSession}
    import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType}
    

    Create SparkSession Object, and Here it's spark

    val spark: SparkSession = SparkSession.builder.master("local").getOrCreate
    val sc = spark.sparkContext // Just used to create test RDDs
    

    Let's an RDD to make it DataFrame

    val rdd = sc.parallelize(
      Seq(
        ("first", Array(2.0, 1.0, 2.1, 5.4)),
        ("test", Array(1.5, 0.5, 0.9, 3.7)),
        ("choose", Array(8.0, 2.9, 9.1, 2.5))
      )
    )
    

    Method 1

    Using SparkSession.createDataFrame(RDD obj).

    val dfWithoutSchema = spark.createDataFrame(rdd)
    
    dfWithoutSchema.show()
    +------+--------------------+
    |    _1|                  _2|
    +------+--------------------+
    | first|[2.0, 1.0, 2.1, 5.4]|
    |  test|[1.5, 0.5, 0.9, 3.7]|
    |choose|[8.0, 2.9, 9.1, 2.5]|
    +------+--------------------+
    

    Method 2

    Using SparkSession.createDataFrame(RDD obj) and specifying column names.

    val dfWithSchema = spark.createDataFrame(rdd).toDF("id", "vals")
    
    dfWithSchema.show()
    +------+--------------------+
    |    id|                vals|
    +------+--------------------+
    | first|[2.0, 1.0, 2.1, 5.4]|
    |  test|[1.5, 0.5, 0.9, 3.7]|
    |choose|[8.0, 2.9, 9.1, 2.5]|
    +------+--------------------+
    

    Method 3 (Actual answer to the question)

    This way requires the input rdd should be of type RDD[Row].

    val rowsRdd: RDD[Row] = sc.parallelize(
      Seq(
        Row("first", 2.0, 7.0),
        Row("second", 3.5, 2.5),
        Row("third", 7.0, 5.9)
      )
    )
    

    create the schema

    val schema = new StructType()
      .add(StructField("id", StringType, true))
      .add(StructField("val1", DoubleType, true))
      .add(StructField("val2", DoubleType, true))
    

    Now apply both rowsRdd and schema to createDataFrame()

    val df = spark.createDataFrame(rowsRdd, schema)
    
    df.show()
    +------+----+----+
    |    id|val1|val2|
    +------+----+----+
    | first| 2.0| 7.0|
    |second| 3.5| 2.5|
    | third| 7.0| 5.9|
    +------+----+----+
    
    0 讨论(0)
  • 2020-11-22 15:27

    Assuming your RDD[row] is called rdd, you can use:

    val sqlContext = new SQLContext(sc) 
    import sqlContext.implicits._
    rdd.toDF()
    
    0 讨论(0)
  • 2020-11-22 15:30

    Suppose you have a DataFrame and you want to do some modification on the fields data by converting it to RDD[Row].

    val aRdd = aDF.map(x=>Row(x.getAs[Long]("id"),x.getAs[List[String]]("role").head))
    

    To convert back to DataFrame from RDD we need to define the structure type of the RDD.

    If the datatype was Long then it will become as LongType in structure.

    If String then StringType in structure.

    val aStruct = new StructType(Array(StructField("id",LongType,nullable = true),StructField("role",StringType,nullable = true)))
    

    Now you can convert the RDD to DataFrame using the createDataFrame method.

    val aNamedDF = sqlContext.createDataFrame(aRdd,aStruct)
    
    0 讨论(0)
  • 2020-11-22 15:30

    Here is a simple example of converting your List into Spark RDD and then converting that Spark RDD into Dataframe.

    Please note that I have used Spark-shell's scala REPL to execute following code, Here sc is an instance of SparkContext which is implicitly available in Spark-shell. Hope it answer your question.

    scala> val numList = List(1,2,3,4,5)
    numList: List[Int] = List(1, 2, 3, 4, 5)
    
    scala> val numRDD = sc.parallelize(numList)
    numRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[80] at parallelize at <console>:28
    
    scala> val numDF = numRDD.toDF
    numDF: org.apache.spark.sql.DataFrame = [_1: int]
    
    scala> numDF.show
    +---+
    | _1|
    +---+
    |  1|
    |  2|
    |  3|
    |  4|
    |  5|
    +---+
    
    0 讨论(0)
  • 2020-11-22 15:33
    One needs to create a schema, and attach it to the Rdd.
    

    Assuming val spark is a product of a SparkSession.builder...

        import org.apache.spark._
        import org.apache.spark.sql._       
        import org.apache.spark.sql.types._
    
        /* Lets gin up some sample data:
         * As RDD's and dataframes can have columns of differing types, lets make our
         * sample data a three wide, two tall, rectangle of mixed types.
         * A column of Strings, a column of Longs, and a column of Doubules 
         */
        val arrayOfArrayOfAnys = Array.ofDim[Any](2,3)
        arrayOfArrayOfAnys(0)(0)="aString"
        arrayOfArrayOfAnys(0)(1)=0L
        arrayOfArrayOfAnys(0)(2)=3.14159
        arrayOfArrayOfAnys(1)(0)="bString"
        arrayOfArrayOfAnys(1)(1)=9876543210L
        arrayOfArrayOfAnys(1)(2)=2.71828
    
        /* The way to convert an anything which looks rectangular, 
         * (Array[Array[String]] or Array[Array[Any]] or Array[Row], ... ) into an RDD is to 
         * throw it into sparkContext.parallelize.
         * http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext shows
         * the parallelize definition as 
         *     def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)
         * so in our case our ArrayOfArrayOfAnys is treated as a sequence of ArraysOfAnys.
         * Will leave the numSlices as the defaultParallelism, as I have no particular cause to change it. 
         */
        val rddOfArrayOfArrayOfAnys=spark.sparkContext.parallelize(arrayOfArrayOfAnys)
    
        /* We'll be using the sqlContext.createDataFrame to add a schema our RDD.
         * The RDD which goes into createDataFrame is an RDD[Row] which is not what we happen to have.
         * To convert anything one tall and several wide into a Row, one can use Row.fromSeq(thatThing.toSeq)
         * As we have an RDD[somethingWeDontWant], we can map each of the RDD rows into the desired Row type. 
         */     
        val rddOfRows=rddOfArrayOfArrayOfAnys.map(f=>
            Row.fromSeq(f.toSeq)
        )
    
        /* Now to construct our schema. This needs to be a StructType of 1 StructField per column in our dataframe.
         * https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.StructField shows the definition as
         *   case class StructField(name: String, dataType: DataType, nullable: Boolean = true, metadata: Metadata = Metadata.empty)
         * Will leave the two default values in place for each of the columns:
         *        nullability as true, 
         *        metadata as an empty Map[String,Any]
         *   
         */
    
        val schema = StructType(
            StructField("colOfStrings", StringType) ::
            StructField("colOfLongs"  , LongType  ) ::
            StructField("colOfDoubles", DoubleType) ::
            Nil
        )
    
        val df=spark.sqlContext.createDataFrame(rddOfRows,schema)
        /*
         *      +------------+----------+------------+
         *      |colOfStrings|colOfLongs|colOfDoubles|
         *      +------------+----------+------------+
         *      |     aString|         0|     3.14159|
         *      |     bString|9876543210|     2.71828|
         *      +------------+----------+------------+
        */ 
        df.show 
    

    Same steps, but with fewer val declarations:

        val arrayOfArrayOfAnys=Array(
            Array("aString",0L         ,3.14159),
            Array("bString",9876543210L,2.71828)
        )
    
        val rddOfRows=spark.sparkContext.parallelize(arrayOfArrayOfAnys).map(f=>Row.fromSeq(f.toSeq))
    
        /* If one knows the datatypes, for instance from JDBC queries as to RDBC column metadata:
         * Consider constructing the schema from an Array[StructField].  This would allow looping over 
         * the columns, with a match statement applying the appropriate sql datatypes as the second
         *  StructField arguments.   
         */
        val sf=new Array[StructField](3)
        sf(0)=StructField("colOfStrings",StringType)
        sf(1)=StructField("colOfLongs"  ,LongType  )
        sf(2)=StructField("colOfDoubles",DoubleType)        
        val df=spark.sqlContext.createDataFrame(rddOfRows,StructType(sf.toList))
        df.show
    
    0 讨论(0)
  • 2020-11-22 15:35

    I tried to explain the solution using the word count problem. 1. Read the file using sc

    1. Produce word count
    2. Methods to create DF

      • rdd.toDF method
      • rdd.toDF("word","count")
        • spark.createDataFrame(rdd,schema)

      Read file using spark

      val rdd=sc.textFile("D://cca175/data/")  
      

      Rdd to Dataframe

      val df=sc.textFile("D://cca175/data/").toDF("t1") df.show

      Method 1

      Create word count RDD to Dataframe

      val df=rdd.flatMap(x=>x.split(" ")).map(x=>(x,1)).reduceByKey((x,y)=>(x+y)).toDF("word","count")
      

      Method2

      Create Dataframe from Rdd

      val df=spark.createDataFrame(wordRdd) 
      # with header   
      val df=spark.createDataFrame(wordRdd).toDF("word","count")  df.show
      

      Method3

      Define Schema

      import org.apache.spark.sql.types._

      val schema=new StructType(). add(StructField("word",StringType,true)). add(StructField("count",StringType,true))

      Create RowRDD

      import org.apache.spark.sql.Row
      val rowRdd=wordRdd.map(x=>(Row(x._1,x._2)))     
      

      Create DataFrame from RDD with schema

      val df=spark.createDataFrame(rowRdd,schema)
      df.show

    0 讨论(0)
提交回复
热议问题