PySpark 2.0 The size or shape of a DataFrame

I am trying to find out the size/shape of a DataFrame in PySpark. I do not see a single function that can do this.

In Python I can do

data.shape()

Is there a similar function in PySpark. This is my current solution, but I am looking for an element one

row_number = data.count()
column_number = len(data.dtypes)

The computation of the number of columns is not ideal...

print((df.count(), len(df.columns)))

Use df.count() to get the number of rows.

Add this to the your code:

def spark_shape(self):
    return (self.count(), len(self.columns))
pyspark.sql.dataframe.DataFrame.shape = spark_shape

Then you can do

>>> df.shape()
(10000, 10)

But just remind you that .count() can be very slow for very large datasets.

print((df.count(), len(df.columns)))

is easier for smaller datasets.

However if the dataset is huge, an alternative approach would be to use pandas and arrows to convert the dataframe to pandas df and call shape

spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark.conf.set("spark.sql.crossJoin.enabled", "true")
print(df.toPandas().shape)

I think there is not similar function like data.shape in Spark. But I will use len(data.columns) rather than len(data.dtypes)

来源：https://stackoverflow.com/questions/39652767/pyspark-2-0-the-size-or-shape-of-a-dataframe

标签

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!