What's the difference between RDD and Dataframe in Spark? [duplicate]

我是研究僧i 提交于 2020-05-17 06:09:38

问题


Hi I am relatively new to apache spark. I wanted to understand the difference between RDD,dataframe and datasets.

For example, I am pulling data from s3 bucket.

df=spark.read.parquet("s3://output/unattributedunattributed*")

In this case when I am loading data from s3, what would be RDD? Also since RDD is immutable , I can change value for df so df couldn't be rdd.

Appreciate if someone can explain the difference between RDD,dataframe and datasets.


回答1:


df=spark.read.parquet("s3://output/unattributedunattributed*")

With this statement, you are creating a data frame.

To create RDD use

df=spark.textFile("s3://output/unattributedunattributed*")

RDD stands for Resilient Distributed Datasets. It is Read-only partition collection of records. RDD is the fundamental data structure of Spark. It allows a programmer to perform in-memory computations

In Dataframe, data organized into named columns. For example a table in a relational database. It is an immutable distributed collection of data. DataFrame in Spark allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction.

  1. If you want to apply a map or filter to the whole dataset, use RDD
  2. If you want to work on an individual column or want to perform operations/calculations on a column then use Dataframe.

for example, if you want to replace 'A' in whole data with 'B' then RDD is useful.

rdd = rdd.map(lambda x: x.replace('A','B')

if you want to update the data type of the column, then use Dataframe.

dff = dff.withColumn("LastmodifiedTime_timestamp", col('LastmodifiedTime_time').cast('timestamp')

RDD can be converted into Dataframe and vice versa.



来源:https://stackoverflow.com/questions/57566876/whats-the-difference-between-rdd-and-dataframe-in-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!