I\'m just wondering what is the difference between an RDD
and DataFrame
(Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]
image credits
RDD
RDD
is a fault-tolerant collection of elements that can be operated on in parallel.
DataFrame
DataFrame
is a Dataset organised into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimisations under the hood.
Dataset
Dataset
is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.
Note:
Dataset of Rows (
Dataset[Row]
) in Scala/Java will often refer as DataFrames.
Nice comparison of all of them with a code snippet.
source
Q: Can you convert one to the other like RDD to DataFrame or vice-versa?
1. RDD
to DataFrame
with .toDF()
val rowsRdd: RDD[Row] = sc.parallelize(
Seq(
Row("first", 2.0, 7.0),
Row("second", 3.5, 2.5),
Row("third", 7.0, 5.9)
)
)
val df = spark.createDataFrame(rowsRdd).toDF("id", "val1", "val2")
df.show()
+------+----+----+
| id|val1|val2|
+------+----+----+
| first| 2.0| 7.0|
|second| 3.5| 2.5|
| third| 7.0| 5.9|
+------+----+----+
more ways: Convert an RDD object to Dataframe in Spark
2. DataFrame
/DataSet
to RDD
with .rdd()
method
val rowsRdd: RDD[Row] = df.rdd() // DataFrame to RDD
Spark RDD –
An RDD stands for Resilient Distributed Datasets. It is Read-only partition collection of records. RDD is the fundamental data structure of Spark. It allows a programmer to perform in-memory computations on large clusters in a fault-tolerant manner. Thus, speed up the task.
Spark Dataframe –
Unlike an RDD, data organized into named columns. For example a table in a relational database. It is an immutable distributed collection of data. DataFrame in Spark allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction.
Spark Dataset –
Datasets in Apache Spark are an extension of DataFrame API which provides type-safe, object-oriented programming interface. Dataset takes advantage of Spark’s Catalyst optimizer by exposing expressions and data fields to a query planner.
A Dataframe is an RDD of Row objects, each representing a record. A Dataframe also knows the schema (i.e., data fields) of its rows. While Dataframes look like regular RDDs, internally they store data in a more efficient manner, taking advantage of their schema. In addition, they provide new operations not available on RDDs, such as the ability to run SQL queries. Dataframes can be created from external data sources, from the results of queries, or from regular RDDs.
Reference: Zaharia M., et al. Learning Spark (O'Reilly, 2015)
A DataFrame is equivalent to a table in RDBMS and can also be manipulated in similar ways to the "native" distributed collections in RDDs. Unlike RDDs, Dataframes keep track of the schema and support various relational operations that lead to more optimized execution. Each DataFrame object represents a logical plan but because of their "lazy" nature no execution occurs until the user calls a specific "output operation".
A DataFrame is an RDD that has a schema. You can think of it as a relational database table, in that each column has a name and a known type. The power of DataFrames comes from the fact that, when you create a DataFrame from a structured dataset (Json, Parquet..), Spark is able to infer a schema by making a pass over the entire (Json, Parquet..) dataset that's being loaded. Then, when calculating the execution plan, Spark, can use the schema and do substantially better computation optimizations. Note that DataFrame was called SchemaRDD before Spark v1.3.0
Most of answers are correct only want to add one point here
In Spark 2.0 the two APIs (DataFrame +DataSet) will be unified together into a single API.
"Unifying DataFrame and Dataset: In Scala and Java, DataFrame and Dataset have been unified, i.e. DataFrame is just a type alias for Dataset of Row. In Python and R, given the lack of type safety, DataFrame is the main programming interface."
Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a specialized Encoder to serialize the objects for processing or transmitting over the network.
Spark SQL supports two different methods for converting existing RDDs into Datasets. The first method uses reflection to infer the schema of an RDD that contains specific types of objects. This reflection based approach leads to more concise code and works well when you already know the schema while writing your Spark application.
The second method for creating Datasets is through a programmatic interface that allows you to construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows you to construct Datasets when the columns and their types are not known until runtime.
Here you can find RDD tof Data frame conversation answer
How to convert rdd object to dataframe in spark