How to optimize percentage check and cols drop in large pyspark dataframe?

删除回忆录丶 提交于 2020-01-15 09:48:08

问题


I have a sample pandas dataframe like as shown below. But my real data is 40 million rows and 5200 columns

 df = pd.DataFrame({
'subject_id':[1,1,1,1,2,2,2,2,3,3,4,4,4,4,4],
'readings' : ['READ_1','READ_2','READ_1','READ_3',np.nan,'READ_5',np.nan,'READ_8','READ_10','READ_12','READ_11','READ_14','READ_09','READ_08','READ_07'],
 'val' :[5,6,7,np.nan,np.nan,7,np.nan,12,13,56,32,13,45,43,46],
 })

from pyspark.sql.types import *
from pyspark.sql.functions import isnan, when, count, col

mySchema = StructType([ StructField("subject_id", LongType(), True)\
                       ,StructField("readings", StringType(), True)\
                       ,StructField("val", FloatType(), True)])

spark_df = spark.createDataFrame(df,schema=mySchema)

spark_df.select([((count(when(isnan(c)|col(c).isNull(), c))/spark_df.count())*100).alias(c) for c in spark_df.columns]).show()

The above code helps me to get the percentage of nulls/nan in each column. But when I run the same on my real data, the code is been running for a long time but no output yet. How do I optimize this search and drop the columns which has 80% of nulls/nan? Below is my server config

UPDATED SCREENSHOT

来源:https://stackoverflow.com/questions/58533373/how-to-optimize-percentage-check-and-cols-drop-in-large-pyspark-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!