Check for empty row within spark dataframe?

问题

Running over several csv files and i am trying to run and do some checks and for some reason for one file i am getting a NullPointerException and i am suspecting that there are some empty row.

So i am running the following and for some reason it gives me an OK output:

check_empty = lambda row : not any([False if k is None else True for k in row])
check_empty_udf = sf.udf(check_empty, BooleanType())
df.filter(check_empty_udf(sf.struct([col for col in df.columns]))).show()

I am missing something within the filter function or we can't extract empty rows from dataframes.

回答1:

You could use df.dropna() to drop empty rows and then compare the counts.

Something like

df_clean = df.dropna()
num_empty_rows = df.count() - df_clean.count()

回答2:

You could use an inbuilt option for dealing with such scenarios.

val df = spark.read
     .format("csv")
     .option("header", "true")
     .option("mode", "DROPMALFORMED") // Drop empty/malformed rows
     .load("hdfs:///path/file.csv")

Check this reference - https://docs.databricks.com/spark/latest/data-sources/read-csv.html#reading-files

来源：https://stackoverflow.com/questions/53376449/check-for-empty-row-within-spark-dataframe

标签

apache-spark

pyspark

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!