How to count occurrences of each distinct value for every column in a dataframe?

前端未结

关注

 6  1247

edf.select(\"x\").distinct.show() shows the distinct values that are present in x column of edf DataFrame.

Is there an efficient

相关标签:

6条回答

2021-02-01 04:00

countDistinct is probably the first choice:

import org.apache.spark.sql.functions.countDistinct

df.agg(countDistinct("some_column"))

If speed is more important than the accuracy you may consider approx_count_distinct (approxCountDistinct in Spark 1.x):

import org.apache.spark.sql.functions.approx_count_distinct

df.agg(approx_count_distinct("some_column"))

To get values and counts:

df.groupBy("some_column").count()

In SQL (spark-sql):

SELECT COUNT(DISTINCT some_column) FROM df

and

SELECT approx_count_distinct(some_column) FROM df

0 讨论(0)

2021-02-01 04:06

import org.apache.spark.sql.functions.countDistinct

df.groupBy("a").agg(countDistinct("s")).collect()

0 讨论(0)

Roughly speaking, how it works: