问题
Running over several csv files and i am trying to run and do some checks and for some reason for one file i am getting a NullPointerException
and i am suspecting that there are some empty row.
So i am running the following and for some reason it gives me an OK
output:
check_empty = lambda row : not any([False if k is None else True for k in row])
check_empty_udf = sf.udf(check_empty, BooleanType())
df.filter(check_empty_udf(sf.struct([col for col in df.columns]))).show()
I am missing something within the filter function or we can't extract empty rows from dataframes.
回答1:
You could use df.dropna() to drop empty rows and then compare the counts.
Something like
df_clean = df.dropna()
num_empty_rows = df.count() - df_clean.count()
回答2:
You could use an inbuilt option for dealing with such scenarios.
val df = spark.read
.format("csv")
.option("header", "true")
.option("mode", "DROPMALFORMED") // Drop empty/malformed rows
.load("hdfs:///path/file.csv")
Check this reference - https://docs.databricks.com/spark/latest/data-sources/read-csv.html#reading-files
来源:https://stackoverflow.com/questions/53376449/check-for-empty-row-within-spark-dataframe