Spark-shell : The number of columns doesn't match

岁酱吖の 提交于 2021-02-11 07:44:22

问题


I have csv format file and is separated by delimiter pipe "|". And the dataset has 2 column, like below .

Column1|Column2
1|Name_a
2|Name_b

But sometimes we receive only one column value and other is missing like below

Column1|Column2
1|Name_a
2|Name_b
3
4
5|Name_c
6
7|Name_f

So any row having mismatched column no is a garbage value for us for the above example it will be rows having column value as 3, 4, and 6 and we want to discard these rows. Is there any direct way I can discard those rows, without having a exception while reading the data from spark-shell like below.

val readFile = spark.read.option("delimiter", "|").csv("File.csv").toDF(Seq("Column1", "Column2"): _*)

When we are trying to read the file we are getting the below exception.

java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match.
Old column names (1): _c0
New column names (2): Column1, Column2
  at scala.Predef$.require(Predef.scala:224)
  at org.apache.spark.sql.Dataset.toDF(Dataset.scala:435)
  ... 49 elided 

回答1:


You can specify schema of your data file and allow some columns to be nullable. In scala it may look like:

val schm = StructType(
  StructField("Column1", StringType, nullable = true) ::
  StructField("Column3", StringType, nullable = true) :: Nil)

val readFile = spark.read.
option("delimiter", "|")
.schema(schm)
.csv("File.csv").toDF

Than you can filter your dataset by column is not null.




回答2:


Just add DROPMALFORMED mode to the option as below, while reading. Setting this makes Spark to drop the corrupted records.

val readFile = spark.read
  .option("delimiter", "|")
  .option("mode", "DROPMALFORMED") // Option to drop invalid rows.
  .csv("File.csv")
  .toDF(Seq("Column1", "Column2"): _*)

This is documented here.



来源:https://stackoverflow.com/questions/54281899/spark-shell-the-number-of-columns-doesnt-match

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!