What does rdd mean in pyspark dataframe

问题

I am new to to pyspark. I am wondering what does rdd mean in pyspark dataframe.

weatherData = spark.read.csv('weather.csv', header=True, inferSchema=True)

These two line of the code has the same output. I am wondering what the effect of having rdd

weatherData.collect()
weatherData.rdd.collect()

回答1:

A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.

So, a DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query.

An RDD, on the other hand, is merely a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it, are not as constrained.

However, you can go from a DataFrame to an RDD via its .rdd method, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via the .toDF() method

In general, it is recommended to use a DataFrame where possible due to the built in query optimization.

来源：https://stackoverflow.com/questions/58367567/what-does-rdd-mean-in-pyspark-dataframe

标签

pyspark

pyspark-sql

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!