I am using spark version 2.3 and working on some poc wherein, I have to load some bunch of csv files to spark dataframe.
Considering below csv as a sample which I need t
you can try something like that, without udfs:
val data = spark.read.option("header", "true").csv("data/yourdata.csv")
val data2 = data.select('id,
('age.cast("double")
.cast("int")
.cast("string")
.equalTo('age) && 'age.cast("int").isNotNull )
.equalTo("true")
.as("isINT"),
'loaded_date.cast("date").isNotNull.as("isDATE"),
('sex.cast("int").isNotNull || 'sex.isNull).notEqual("true").as("isCHAR"))
data2.show()
+---+-----+------+------+
| id|isINT|isDATE|isCHAR|
+---+-----+------+------+
| 1| true| true| true|
| 2| true| true| true|
| 3| true| true| true|
| 4| true| false| true|
| 5| true| false| false|
| 6| true| false| true|
| 7| true| true| false|
| 8|false| true| true|
| 9|false| true| true|
| 10| true| false| true|
+---+-----+------+------+
val corrupted = data2.select('id,
concat(data2.columns.map(data2(_)).drop(1):_*).contains("false").as("isCorrupted")
)
corrupted.show()
+---+-----------+
| id|isCorrupted|
+---+-----------+
| 1| false|
| 2| false|
| 3| false|
| 4| true|
| 5| true|
| 6| true|
| 7| true|
| 8| true|
| 9| true|
| 10| true|
+---+-----------+
data.join(corrupted,"id").show()
+---+----+---+-----------+----+-----------+
| id|name|age|loaded_date| sex|isCorrupted|
+---+----+---+-----------+----+-----------+
| 1| ABC| 32| 2019-09-11| M| false|
| 2|null| 33| 2019-09-11| M| false|
| 3| XYZ| 35| 2019-08-11| M| false|
| 4| PQR| 32| 2019-30-10| M| true|
| 5| EFG| 32| null|null| true|
| 6| DEF| 32| 2019/09/11| M| true|
| 7| XYZ| 32| 2017-01-01| 9| true|
| 8| KLM| XX| 2017-01-01| F| true|
| 9| ABC|3.2| 2019-10-10| M| true|
| 10| ABC| 32| 2019-02-29| M| true|
+---+----+---+-----------+----+-----------+