Best way to get null counts, min and max values of multiple (100+) columns from a pyspark dataframe

狂风中的少年 提交于 2021-01-29 11:19:31

问题


Say I have a list of column names and they all exist in the dataframe

Cols = ['A', 'B', 'C', 'D'],

I am looking for a quick way to get a table/dataframe like

     NA_counts min     max
A        5      0      100
B        10     0      120
C        8      1      99
D        2      0      500

TIA


回答1:


You can calculate each metric separately and then union all like this:

nulls_cols = [sum(when(col(c).isNull(), lit(1)).otherwise(lit(0))).alias(c) for c in cols]
max_cols = [max(col(c)).alias(c) for c in cols]
min_cols = [min(col(c)).alias(c) for c in cols]

nulls_df = df.select(lit("NA_counts").alias("count"), *nulls_cols)
max_df = df.select(lit("Max").alias("count"), *max_cols)
min_df = df.select(lit("Min").alias("count"), *min_cols)

nulls_df.unionAll(max_df).unionAll(min_df).show()

Output example:

+---------+---+---+----+----+
|    count|  A|  B|   C|   D|
+---------+---+---+----+----+
|NA_counts|  1|  0|   3|   1|
|      Max|  9|  5|Test|2017|
|      Min|  1|  0|Test|2010|
+---------+---+---+----+----+



来源:https://stackoverflow.com/questions/59312759/best-way-to-get-null-counts-min-and-max-values-of-multiple-100-columns-from

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!