Difference between DataFrame, Dataset, and RDD in Spark

后端 未结 15 1233
慢半拍i
慢半拍i 2020-11-22 15:53

I\'m just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]

相关标签:
15条回答
  • 2020-11-22 16:22

    All great answer and using each API has some trade off. Dataset is built to be super API to solve lot of problem but many times RDD still works best if you understand your data and if processing algorithm is optimized to do lot of things in Single pass to large data then RDD seems to best option.

    Aggregation using dataset API still consume memory and will get better over time.

    0 讨论(0)
  • 2020-11-22 16:26

    First thing is DataFrame was evolved from SchemaRDD.

    Yes.. conversion between Dataframe and RDD is absolutely possible.

    Below are some sample code snippets.

    • df.rdd is RDD[Row]

    Below are some of options to create dataframe.

    • 1) yourrddOffrow.toDF converts to DataFrame.

    • 2) Using createDataFrame of sql context

      val df = spark.createDataFrame(rddOfRow, schema)

    where schema can be from some of below options as described by nice SO post..
    From scala case class and scala reflection api

    import org.apache.spark.sql.catalyst.ScalaReflection
    val schema = ScalaReflection.schemaFor[YourScalacaseClass].dataType.asInstanceOf[StructType]
    

    OR using Encoders

    import org.apache.spark.sql.Encoders
    val mySchema = Encoders.product[MyCaseClass].schema
    

    as described by Schema can also be created using StructType and StructField

    val schema = new StructType()
      .add(StructField("id", StringType, true))
      .add(StructField("col1", DoubleType, true))
      .add(StructField("col2", DoubleType, true)) etc...
    

    In fact there Are Now 3 Apache Spark APIs..

    1. RDD API :

    The RDD (Resilient Distributed Dataset) API has been in Spark since the 1.0 release.

    The RDD API provides many transformation methods, such as map(), filter(), and reduce() for performing computations on the data. Each of these methods results in a new RDD representing the transformed data. However, these methods are just defining the operations to be performed and the transformations are not performed until an action method is called. Examples of action methods are collect() and saveAsObjectFile().

    RDD Example:

    rdd.filter(_.age > 21) // transformation
       .map(_.last)// transformation
    .saveAsObjectFile("under21.bin") // action
    

    Example: Filter by attribute with RDD

    rdd.filter(_.age > 21)
    
    1. DataFrame API

    Spark 1.3 introduced a new DataFrame API as part of the Project Tungsten initiative which seeks to improve the performance and scalability of Spark. The DataFrame API introduces the concept of a schema to describe the data, allowing Spark to manage the schema and only pass data between nodes, in a much more efficient way than using Java serialization.

    The DataFrame API is radically different from the RDD API because it is an API for building a relational query plan that Spark’s Catalyst optimizer can then execute. The API is natural for developers who are familiar with building query plans

    Example SQL style :

    df.filter("age > 21");

    Limitations : Because the code is referring to data attributes by name, it is not possible for the compiler to catch any errors. If attribute names are incorrect then the error will only detected at runtime, when the query plan is created.

    Another downside with the DataFrame API is that it is very scala-centric and while it does support Java, the support is limited.

    For example, when creating a DataFrame from an existing RDD of Java objects, Spark’s Catalyst optimizer cannot infer the schema and assumes that any objects in the DataFrame implement the scala.Product interface. Scala case class works out the box because they implement this interface.

    1. Dataset API

    The Dataset API, released as an API preview in Spark 1.6, aims to provide the best of both worlds; the familiar object-oriented programming style and compile-time type-safety of the RDD API but with the performance benefits of the Catalyst query optimizer. Datasets also use the same efficient off-heap storage mechanism as the DataFrame API.

    When it comes to serializing data, the Dataset API has the concept of encoders which translate between JVM representations (objects) and Spark’s internal binary format. Spark has built-in encoders which are very advanced in that they generate byte code to interact with off-heap data and provide on-demand access to individual attributes without having to de-serialize an entire object. Spark does not yet provide an API for implementing custom encoders, but that is planned for a future release.

    Additionally, the Dataset API is designed to work equally well with both Java and Scala. When working with Java objects, it is important that they are fully bean-compliant.

    Example Dataset API SQL style :

    dataset.filter(_.age < 21);
    

    Evaluations diff. between DataFrame & DataSet :

    Catalist level flow..(Demystifying DataFrame and Dataset presentation from spark summit)

    Further reading... databricks article - A Tale of Three Apache Spark APIs: RDDs vs DataFrames and Datasets

    0 讨论(0)
  • 2020-11-22 16:30

    Spark RDD (resilient distributed dataset) :

    RDD is the core data abstraction API and is available since very first release of Spark (Spark 1.0). It is a lower-level API for manipulating distributed collection of data. The RDD APIs exposes some extremely useful methods which can be used to get very tight control over underlying physical data structure. It is an immutable (read only) collection of partitioned data distributed on different machines. RDD enables in-memory computation on large clusters to speed up big data processing in a fault tolerant manner. To enable fault tolerance, RDD uses DAG (Directed Acyclic Graph) which consists of a set of vertices and edges. The vertices and edges in DAG represent the RDD and the operation to be applied on that RDD respectively. The transformations defined on RDD are lazy and executes only when an action is called

    Spark DataFrame :

    Spark 1.3 introduced two new data abstraction APIs – DataFrame and DataSet. The DataFrame APIs organizes the data into named columns like a table in relational database. It enables programmers to define schema on a distributed collection of data. Each row in a DataFrame is of object type row. Like an SQL table, each column must have same number of rows in a DataFrame. In short, DataFrame is lazily evaluated plan which specifies the operations needs to be performed on the distributed collection of the data. DataFrame is also an immutable collection.

    Spark DataSet :

    As an extension to the DataFrame APIs, Spark 1.3 also introduced DataSet APIs which provides strictly typed and object-oriented programming interface in Spark. It is immutable, type-safe collection of distributed data. Like DataFrame, DataSet APIs also uses Catalyst engine in order to enable execution optimization. DataSet is an extension to the DataFrame APIs.

    Other Differences -

    0 讨论(0)
提交回复
热议问题