问题
I have csv format file and is separated by delimiter pipe "|". And the dataset has 2 column, like below .
Column1|Column2
1|Name_a
2|Name_b
But sometimes we receive only one column value and other is missing like below
Column1|Column2
1|Name_a
2|Name_b
3
4
5|Name_c
6
7|Name_f
So any row having mismatched column no is a garbage value for us for the above example it will be rows having column value as 3, 4, and 6
and we want to discard these rows. Is there any direct way I can discard those rows, without having a exception while reading the data from spark-shell like below.
val readFile = spark.read.option("delimiter", "|").csv("File.csv").toDF(Seq("Column1", "Column2"): _*)
When we are trying to read the file we are getting the below exception.
java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match.
Old column names (1): _c0
New column names (2): Column1, Column2
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.sql.Dataset.toDF(Dataset.scala:435)
... 49 elided
回答1:
You can specify schema of your data file and allow some columns to be nullable. In scala it may look like:
val schm = StructType(
StructField("Column1", StringType, nullable = true) ::
StructField("Column3", StringType, nullable = true) :: Nil)
val readFile = spark.read.
option("delimiter", "|")
.schema(schm)
.csv("File.csv").toDF
Than you can filter your dataset by column is not null.
回答2:
Just add DROPMALFORMED
mode to the option as below, while reading. Setting this makes Spark to drop the corrupted records.
val readFile = spark.read
.option("delimiter", "|")
.option("mode", "DROPMALFORMED") // Option to drop invalid rows.
.csv("File.csv")
.toDF(Seq("Column1", "Column2"): _*)
This is documented here.
来源:https://stackoverflow.com/questions/54281899/spark-shell-the-number-of-columns-doesnt-match