问题
I am trying to find out the size/shape of a DataFrame in PySpark. I do not see a single function that can do this.
In Python I can do
data.shape()
Is there a similar function in PySpark. This is my current solution, but I am looking for an element one
row_number = data.count()
column_number = len(data.dtypes)
The computation of the number of columns is not ideal...
回答1:
print((df.count(), len(df.columns)))
回答2:
Use df.count()
to get the number of rows.
回答3:
Add this to the your code:
def spark_shape(self):
return (self.count(), len(self.columns))
pyspark.sql.dataframe.DataFrame.shape = spark_shape
Then you can do
>>> df.shape()
(10000, 10)
But just remind you that .count()
can be very slow for very large datasets.
回答4:
print((df.count(), len(df.columns)))
is easier for smaller datasets.
However if the dataset is huge, an alternative approach would be to use pandas and arrows to convert the dataframe to pandas df and call shape
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark.conf.set("spark.sql.crossJoin.enabled", "true")
print(df.toPandas().shape)
回答5:
I think there is not similar function like data.shape
in Spark. But I will use len(data.columns)
rather than len(data.dtypes)
来源:https://stackoverflow.com/questions/39652767/pyspark-2-0-the-size-or-shape-of-a-dataframe