Parsing a csv file in Pyspark using Spark inbuilt functions or methods

后端 未结 2 1356
滥情空心
滥情空心 2021-01-20 22:50

I am using spark version 2.3 and working on some poc wherein, I have to load some bunch of csv files to spark dataframe.

Considering below csv as a sample which I need t

2条回答
  •  无人共我
    2021-01-20 23:20

    you can try something like that, without udfs:

    val data = spark.read.option("header", "true").csv("data/yourdata.csv")
    val data2 = data.select('id,
    ('age.cast("double")
      .cast("int")
      .cast("string")
      .equalTo('age) && 'age.cast("int").isNotNull )
      .equalTo("true")
      .as("isINT"),
    'loaded_date.cast("date").isNotNull.as("isDATE"),
    ('sex.cast("int").isNotNull || 'sex.isNull).notEqual("true").as("isCHAR"))
    
    data2.show()
    +---+-----+------+------+
    | id|isINT|isDATE|isCHAR|
    +---+-----+------+------+
    |  1| true|  true|  true|
    |  2| true|  true|  true|
    |  3| true|  true|  true|
    |  4| true| false|  true|
    |  5| true| false| false|
    |  6| true| false|  true|
    |  7| true|  true| false|
    |  8|false|  true|  true|
    |  9|false|  true|  true|
    | 10| true| false|  true|
    +---+-----+------+------+
    
    val corrupted = data2.select('id,
        concat(data2.columns.map(data2(_)).drop(1):_*).contains("false").as("isCorrupted")
      )
      corrupted.show()
    
    +---+-----------+
    | id|isCorrupted|
    +---+-----------+
    |  1|      false|
    |  2|      false|
    |  3|      false|
    |  4|       true|
    |  5|       true|
    |  6|       true|
    |  7|       true|
    |  8|       true|
    |  9|       true|
    | 10|       true|
    +---+-----------+
    
    data.join(corrupted,"id").show()
    
    +---+----+---+-----------+----+-----------+
    | id|name|age|loaded_date| sex|isCorrupted|
    +---+----+---+-----------+----+-----------+
    |  1| ABC| 32| 2019-09-11|   M|      false|
    |  2|null| 33| 2019-09-11|   M|      false|
    |  3| XYZ| 35| 2019-08-11|   M|      false|
    |  4| PQR| 32| 2019-30-10|   M|       true|
    |  5| EFG| 32|       null|null|       true|
    |  6| DEF| 32| 2019/09/11|   M|       true|
    |  7| XYZ| 32| 2017-01-01|   9|       true|
    |  8| KLM| XX| 2017-01-01|   F|       true|
    |  9| ABC|3.2| 2019-10-10|   M|       true|
    | 10| ABC| 32| 2019-02-29|   M|       true|
    +---+----+---+-----------+----+-----------+
    

提交回复
热议问题