Difference between DataFrame, Dataset, and RDD in Spark

后端 未结 15 1232
慢半拍i
慢半拍i 2020-11-22 15:53

I\'m just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]

相关标签:
15条回答
  • 2020-11-22 16:09

    All(RDD, DataFrame and DataSet) in one picture.

    RDD vs DataFrame vs DataSet

    image credits

    RDD

    RDD is a fault-tolerant collection of elements that can be operated on in parallel.

    DataFrame

    DataFrame is a Dataset organised into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimisations under the hood.

    Dataset

    Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.


    Note:

    Dataset of Rows (Dataset[Row]) in Scala/Java will often refer as DataFrames.


    Nice comparison of all of them with a code snippet.

    RDD vs DataFrame vs DataSet with code

    source


    Q: Can you convert one to the other like RDD to DataFrame or vice-versa?

    Yes, both are possible

    1. RDD to DataFrame with .toDF()

    val rowsRdd: RDD[Row] = sc.parallelize(
      Seq(
        Row("first", 2.0, 7.0),
        Row("second", 3.5, 2.5),
        Row("third", 7.0, 5.9)
      )
    )
    
    val df = spark.createDataFrame(rowsRdd).toDF("id", "val1", "val2")
    
    df.show()
    +------+----+----+
    |    id|val1|val2|
    +------+----+----+
    | first| 2.0| 7.0|
    |second| 3.5| 2.5|
    | third| 7.0| 5.9|
    +------+----+----+
    

    more ways: Convert an RDD object to Dataframe in Spark

    2. DataFrame/DataSet to RDD with .rdd() method

    val rowsRdd: RDD[Row] = df.rdd() // DataFrame to RDD
    
    0 讨论(0)
  • 2020-11-22 16:10

    Apache Spark – RDD, DataFrame, and DataSet

    Spark RDD

    An RDD stands for Resilient Distributed Datasets. It is Read-only partition collection of records. RDD is the fundamental data structure of Spark. It allows a programmer to perform in-memory computations on large clusters in a fault-tolerant manner. Thus, speed up the task.

    Spark Dataframe

    Unlike an RDD, data organized into named columns. For example a table in a relational database. It is an immutable distributed collection of data. DataFrame in Spark allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction.

    Spark Dataset

    Datasets in Apache Spark are an extension of DataFrame API which provides type-safe, object-oriented programming interface. Dataset takes advantage of Spark’s Catalyst optimizer by exposing expressions and data fields to a query planner.

    0 讨论(0)
  • 2020-11-22 16:11

    A Dataframe is an RDD of Row objects, each representing a record. A Dataframe also knows the schema (i.e., data fields) of its rows. While Dataframes look like regular RDDs, internally they store data in a more efficient manner, taking advantage of their schema. In addition, they provide new operations not available on RDDs, such as the ability to run SQL queries. Dataframes can be created from external data sources, from the results of queries, or from regular RDDs.

    Reference: Zaharia M., et al. Learning Spark (O'Reilly, 2015)

    0 讨论(0)
  • 2020-11-22 16:12

    A DataFrame is equivalent to a table in RDBMS and can also be manipulated in similar ways to the "native" distributed collections in RDDs. Unlike RDDs, Dataframes keep track of the schema and support various relational operations that lead to more optimized execution. Each DataFrame object represents a logical plan but because of their "lazy" nature no execution occurs until the user calls a specific "output operation".

    0 讨论(0)
  • 2020-11-22 16:12

    A DataFrame is an RDD that has a schema. You can think of it as a relational database table, in that each column has a name and a known type. The power of DataFrames comes from the fact that, when you create a DataFrame from a structured dataset (Json, Parquet..), Spark is able to infer a schema by making a pass over the entire (Json, Parquet..) dataset that's being loaded. Then, when calculating the execution plan, Spark, can use the schema and do substantially better computation optimizations. Note that DataFrame was called SchemaRDD before Spark v1.3.0

    0 讨论(0)
  • 2020-11-22 16:14

    Most of answers are correct only want to add one point here

    In Spark 2.0 the two APIs (DataFrame +DataSet) will be unified together into a single API.

    "Unifying DataFrame and Dataset: In Scala and Java, DataFrame and Dataset have been unified, i.e. DataFrame is just a type alias for Dataset of Row. In Python and R, given the lack of type safety, DataFrame is the main programming interface."

    Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a specialized Encoder to serialize the objects for processing or transmitting over the network.

    Spark SQL supports two different methods for converting existing RDDs into Datasets. The first method uses reflection to infer the schema of an RDD that contains specific types of objects. This reflection based approach leads to more concise code and works well when you already know the schema while writing your Spark application.

    The second method for creating Datasets is through a programmatic interface that allows you to construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows you to construct Datasets when the columns and their types are not known until runtime.

    Here you can find RDD tof Data frame conversation answer

    How to convert rdd object to dataframe in spark

    0 讨论(0)
提交回复
热议问题