How to count occurrences of each distinct value for every column in a dataframe?

前端 未结 6 1247
感动是毒
感动是毒 2021-02-01 03:48

edf.select(\"x\").distinct.show() shows the distinct values that are present in x column of edf DataFrame.

Is there an efficient

相关标签:
6条回答
  • 2021-02-01 03:51
    df.select("some_column").distinct.count
    
    0 讨论(0)
  • 2021-02-01 03:54

    Another option without resorting to sql functions

    df.groupBy('your_column_name').count().show()
    

    show will print the different values and their occurrences. The result without show will be a dataframe.

    0 讨论(0)
  • 2021-02-01 03:57

    Roughly speaking, how it works:

    0 讨论(0)
  • 2021-02-01 04:00

    countDistinct is probably the first choice:

    import org.apache.spark.sql.functions.countDistinct
    
    df.agg(countDistinct("some_column"))
    

    If speed is more important than the accuracy you may consider approx_count_distinct (approxCountDistinct in Spark 1.x):

    import org.apache.spark.sql.functions.approx_count_distinct
    
    df.agg(approx_count_distinct("some_column"))
    

    To get values and counts:

    df.groupBy("some_column").count()
    

    In SQL (spark-sql):

    SELECT COUNT(DISTINCT some_column) FROM df
    

    and

    SELECT approx_count_distinct(some_column) FROM df
    
    0 讨论(0)
  • 2021-02-01 04:02

    If you are using Java, the import org.apache.spark.sql.functions.countDistinct; will give an error : The import org.apache.spark.sql.functions.countDistinct cannot be resolved

    To use the countDistinct in java, use the below format:

    import org.apache.spark.sql.functions.*;
    import org.apache.spark.sql.*;
    import org.apache.spark.sql.types.*;
    
    df.agg(functions.countDistinct("some_column"));
    
    0 讨论(0)
  • 2021-02-01 04:06
    import org.apache.spark.sql.functions.countDistinct
    
    df.groupBy("a").agg(countDistinct("s")).collect()
    
    0 讨论(0)
提交回复
热议问题