Pyspark groupBy DataFrame without aggregation or count

问题

Can it iterate through the Pyspark groupBy dataframe without aggregation or count?

For example code in Pandas:

for i, d in df2:
      mycode ....

^^ if using pandas ^^
Is there a difference in how to iterate groupby in Pyspark or have to use aggregation and count?

回答1:

At best you can use .first , .last to get respective values from the groupBy but not all in the way you can get in pandas.

ex:

from pyspark.sql import functions as f
df.groupBy(df['some_col']).agg(f.first(df['col1']), f.first(df['col2'])).show()

Since their is a basic difference between the way the data is handled in pandas and spark not all functionalities can be used in the same way.

Their are a few work arounds to get what you want like:

for diamonds DataFrame:

+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+
|_c0|carat|      cut|color|clarity|depth|table|price|   x|   y|   z|
+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+
|  1| 0.23|    Ideal|    E|    SI2| 61.5| 55.0|  326|3.95|3.98|2.43|
|  2| 0.21|  Premium|    E|    SI1| 59.8| 61.0|  326|3.89|3.84|2.31|
|  3| 0.23|     Good|    E|    VS1| 56.9| 65.0|  327|4.05|4.07|2.31|
|  4| 0.29|  Premium|    I|    VS2| 62.4| 58.0|  334| 4.2|4.23|2.63|
|  5| 0.31|     Good|    J|    SI2| 63.3| 58.0|  335|4.34|4.35|2.75|
+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+

You can use:

l=[x.cut for x in diamonds.select("cut").distinct().rdd.collect()]
def groups(df,i):
  import pyspark.sql.functions as f
  return df.filter(f.col("cut")==i)

#for multi grouping
def groups_multi(df,i):
  import pyspark.sql.functions as f
  return df.filter((f.col("cut")==i) & (f.col("color")=='E'))# use | for or.

for i in l:
  groups(diamonds,i).show(2)

output :

+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|_c0|carat|    cut|color|clarity|depth|table|price|   x|   y|   z|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|  2| 0.21|Premium|    E|    SI1| 59.8| 61.0|  326|3.89|3.84|2.31|
|  4| 0.29|Premium|    I|    VS2| 62.4| 58.0|  334| 4.2|4.23|2.63|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
only showing top 2 rows

+---+-----+-----+-----+-------+-----+-----+-----+----+----+----+
|_c0|carat|  cut|color|clarity|depth|table|price|   x|   y|   z|
+---+-----+-----+-----+-------+-----+-----+-----+----+----+----+
|  1| 0.23|Ideal|    E|    SI2| 61.5| 55.0|  326|3.95|3.98|2.43|
| 12| 0.23|Ideal|    J|    VS1| 62.8| 56.0|  340|3.93| 3.9|2.46|
+---+-----+-----+-----+-------+-----+-----+-----+----+----+----+

...

In Function groups you can decide what kind of grouping you want for the data. It is a simple filter condition but it will get you all the groups separately.

回答2:

When we do a GroupBy we end up with a RelationalGroupedDataset, which is a fancy name for a DataFrame that has a grouping specified but needs the user to specify an aggregation before it can be queried further.

When you try to do any functions on the Grouped dataframe it throws an error

AttributeError: 'GroupedData' object has no attribute 'show'

来源：https://stackoverflow.com/questions/59622573/pyspark-groupby-dataframe-without-aggregation-or-count

标签

python

pyspark

pyspark-dataframes