I have a dataframe like this:
df = spark.createDataFrame([(0, [\"B\",\"C\",\"D\",\"E\"]),(1,[\"E\",\"A\",\"C\"]),(2, [\"F\",\"A\",\"E\",\"B\"]),(3,[\"E\",\"G
Observation:
explode
won't change overall amount of data in your pipeline. The total amount of required space is the same in both wide (array
) and long (exploded
) format. Moreover the latter one distributes better in Spark, which better suited for long and narrow than short and wide data. So
df.select(explode("items").alias("item")).groupBy("item").count()
is the way to go.
However if you really want to avoid that (for whatever reason) you can use RDD
and aggregate
.
from collections import Counter
df.rdd.aggregate(
Counter(),
lambda acc, row: acc + Counter(row.items),
lambda acc1, acc2: acc1 + acc2
)
# Counter({'B': 3, 'C': 3, 'D': 2, 'E': 5, 'A': 4, 'F': 1, 'G': 1})
Not that, unlike the DataFrame
explode
, it stores all data in memory and is eager.