How to convert rdd object to dataframe in spark

前端 未结 11 2113
慢半拍i
慢半拍i 2020-11-22 14:59

How can I convert an RDD (org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]) to a Dataframe org.apache.spark.sql.DataFrame. I converted a datafram

相关标签:
11条回答
  • 2020-11-22 15:40

    SparkSession has a number of createDataFrame methods that create a DataFrame given an RDD. I imagine one of these will work for your context.

    For example:

    def createDataFrame(rowRDD: RDD[Row], schema: StructType): DataFrame
    

    Creates a DataFrame from an RDD containing Rows using the given schema.

    0 讨论(0)
  • 2020-11-22 15:41

    Note: This answer was originally posted here

    I am posting this answer because I would like to share additional details about the available options that I did not find in the other answers


    To create a DataFrame from an RDD of Rows, there are two main options:

    1) As already pointed out, you could use toDF() which can be imported by import sqlContext.implicits._. However, this approach only works for the following types of RDDs:

    • RDD[Int]
    • RDD[Long]
    • RDD[String]
    • RDD[T <: scala.Product]

    (source: Scaladoc of the SQLContext.implicits object)

    The last signature actually means that it can work for an RDD of tuples or an RDD of case classes (because tuples and case classes are subclasses of scala.Product).

    So, to use this approach for an RDD[Row], you have to map it to an RDD[T <: scala.Product]. This can be done by mapping each row to a custom case class or to a tuple, as in the following code snippets:

    val df = rdd.map({ 
      case Row(val1: String, ..., valN: Long) => (val1, ..., valN)
    }).toDF("col1_name", ..., "colN_name")
    

    or

    case class MyClass(val1: String, ..., valN: Long = 0L)
    val df = rdd.map({ 
      case Row(val1: String, ..., valN: Long) => MyClass(val1, ..., valN)
    }).toDF("col1_name", ..., "colN_name")
    

    The main drawback of this approach (in my opinion) is that you have to explicitly set the schema of the resulting DataFrame in the map function, column by column. Maybe this can be done programatically if you don't know the schema in advance, but things can get a little messy there. So, alternatively, there is another option:


    2) You can use createDataFrame(rowRDD: RDD[Row], schema: StructType) as in the accepted answer, which is available in the SQLContext object. Example for converting an RDD of an old DataFrame:

    val rdd = oldDF.rdd
    val newDF = oldDF.sqlContext.createDataFrame(rdd, oldDF.schema)
    

    Note that there is no need to explicitly set any schema column. We reuse the old DF's schema, which is of StructType class and can be easily extended. However, this approach sometimes is not possible, and in some cases can be less efficient than the first one.

    0 讨论(0)
  • 2020-11-22 15:43

    On newer versions of spark (2.0+)

    import org.apache.spark.sql.SparkSession
    import org.apache.spark.sql.functions._
    import org.apache.spark.sql._
    import org.apache.spark.sql.types._
    
    val spark = SparkSession
      .builder()
      .getOrCreate()
    import spark.implicits._
    
    val dfSchema = Seq("col1", "col2", "col3")
    rdd.toDF(dfSchema: _*)
    
    0 讨论(0)
  • 2020-11-22 15:44

    Method 1: (Scala)

    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    import sqlContext.implicits._
    val df_2 = sc.parallelize(Seq((1L, 3.0, "a"), (2L, -1.0, "b"), (3L, 0.0, "c"))).toDF("x", "y", "z")
    

    Method 2: (Scala)

    case class temp(val1: String,val3 : Double) 
    
    val rdd = sc.parallelize(Seq(
      Row("foo",  0.5), Row("bar",  0.0)
    ))
    val rows = rdd.map({case Row(val1:String,val3:Double) => temp(val1,val3)}).toDF()
    rows.show()
    

    Method 1: (Python)

    from pyspark.sql import Row
    l = [('Alice',2)]
    Person = Row('name','age')
    rdd = sc.parallelize(l)
    person = rdd.map(lambda r:Person(*r))
    df2 = sqlContext.createDataFrame(person)
    df2.show()
    

    Method 2: (Python)

    from pyspark.sql.types import * 
    l = [('Alice',2)]
    rdd = sc.parallelize(l)
    schema =  StructType([StructField ("name" , StringType(), True) , 
    StructField("age" , IntegerType(), True)]) 
    df3 = sqlContext.createDataFrame(rdd, schema) 
    df3.show()
    

    Extracted the value from the row object and then applied the case class to convert rdd to DF

    val temp1 = attrib1.map{case Row ( key: Int ) => s"$key" }
    val temp2 = attrib2.map{case Row ( key: Int) => s"$key" }
    
    case class RLT (id: String, attrib_1 : String, attrib_2 : String)
    import hiveContext.implicits._
    
    val df = result.map{ s => RLT(s(0),s(1),s(2)) }.toDF
    
    0 讨论(0)
  • 2020-11-22 15:44

    To convert an Array[Row] to DataFrame or Dataset, the following works elegantly:

    Say, schema is the StructType for the row,then

    val rows: Array[Row]=...
    implicit val encoder = RowEncoder.apply(schema)
    import spark.implicits._
    rows.toDS
    
    0 讨论(0)
提交回复
热议问题