Parsing a csv file in Pyspark using Spark inbuilt functions or methods

后端 未结 2 1354
滥情空心
滥情空心 2021-01-20 22:50

I am using spark version 2.3 and working on some poc wherein, I have to load some bunch of csv files to spark dataframe.

Considering below csv as a sample which I need t

相关标签:
2条回答
  • 2021-01-20 23:20

    As requested by the OP, I am jotting down the answer here in PySpark -

    First of all, just load the data without any prespecified schema, also as done by @AndrzejS

    df = spark.read.option("header", "true").csv("data/yourdata.csv")
    df.show()
    +---+----+---+-----------+----+
    | id|name|age|loaded_date| sex|
    +---+----+---+-----------+----+
    |  1| ABC| 32| 2019-09-11|   M|
    |  2|null| 33| 2019-09-11|   M|
    |  3| XYZ| 35| 2019-08-11|   M|
    |  4| PQR| 32| 2019-30-10|   M|
    |  5| EFG| 32|       null|null|
    |  6| DEF| 32| 2019/09/11|   M|
    |  7| XYZ| 32| 2017-01-01|   9|
    |  8| KLM| XX| 2017-01-01|   F|
    |  9| ABC|3.2| 2019-10-10|   M|
    | 10| ABC| 32| 2019-02-29|   M|
    +---+----+---+-----------+----+
    

    Then, we need to determine the which of the values do not fit into the scheme of columns. For eg; XX or 32 cannot be an age, so these values need to be marked as Null. We do a test if this value is an Integer or else. Similarly, we do the test if loaded_date is indeed a date or not and fianlly we check if the sex is either F/M. Please refer to my previous post on these tests.

    df = df.select('id','name',
                   'age', (col('age').cast('int').isNotNull() & (col('age').cast('int') - col('age') == 0)).alias('ageInt'),
                   'loaded_date',(col('loaded_date').cast('date').isNotNull()).alias('loaded_dateDate'),
                   'sex'
                  )
    df.show()
    +---+----+---+------+-----------+---------------+----+
    | id|name|age|ageInt|loaded_date|loaded_dateDate| sex|
    +---+----+---+------+-----------+---------------+----+
    |  1| ABC| 32|  true| 2019-09-11|           true|   M|
    |  2|null| 33|  true| 2019-09-11|           true|   M|
    |  3| XYZ| 35|  true| 2019-08-11|           true|   M|
    |  4| PQR| 32|  true| 2019-30-10|          false|   M|
    |  5| EFG| 32|  true|       null|          false|null|
    |  6| DEF| 32|  true| 2019/09/11|          false|   M|
    |  7| XYZ| 32|  true| 2017-01-01|           true|   9|
    |  8| KLM| XX| false| 2017-01-01|           true|   F|
    |  9| ABC|3.2| false| 2019-10-10|           true|   M|
    | 10| ABC| 32|  true| 2019-02-29|          false|   M|
    +---+----+---+------+-----------+---------------+----+
    

    Finally, using if/else, which is pyspark is when/otherwise to mark irrelevant values as Null.

    df = df.withColumn('age',when(col('ageInt')==True,col('age')).otherwise(None))\
           .withColumn('loaded_date',when(col('loaded_dateDate')==True,col('loaded_date')).otherwise(None))\
           .withColumn('sex',when(col('sex').isin('M','F'),col('sex')).otherwise(None))\
           .drop('ageInt','loaded_dateDate')
    df.show()
    +---+----+----+-----------+----+
    | id|name| age|loaded_date| sex|
    +---+----+----+-----------+----+
    |  1| ABC|  32| 2019-09-11|   M|
    |  2|null|  33| 2019-09-11|   M|
    |  3| XYZ|  35| 2019-08-11|   M|
    |  4| PQR|  32|       null|   M|
    |  5| EFG|  32|       null|null|
    |  6| DEF|  32|       null|   M|
    |  7| XYZ|  32| 2017-01-01|null|
    |  8| KLM|null| 2017-01-01|   F|
    |  9| ABC|null| 2019-10-10|   M|
    | 10| ABC|  32|       null|   M|
    +---+----+----+-----------+----+
    
    0 讨论(0)
  • 2021-01-20 23:20

    you can try something like that, without udfs:

    val data = spark.read.option("header", "true").csv("data/yourdata.csv")
    val data2 = data.select('id,
    ('age.cast("double")
      .cast("int")
      .cast("string")
      .equalTo('age) && 'age.cast("int").isNotNull )
      .equalTo("true")
      .as("isINT"),
    'loaded_date.cast("date").isNotNull.as("isDATE"),
    ('sex.cast("int").isNotNull || 'sex.isNull).notEqual("true").as("isCHAR"))
    
    data2.show()
    +---+-----+------+------+
    | id|isINT|isDATE|isCHAR|
    +---+-----+------+------+
    |  1| true|  true|  true|
    |  2| true|  true|  true|
    |  3| true|  true|  true|
    |  4| true| false|  true|
    |  5| true| false| false|
    |  6| true| false|  true|
    |  7| true|  true| false|
    |  8|false|  true|  true|
    |  9|false|  true|  true|
    | 10| true| false|  true|
    +---+-----+------+------+
    
    val corrupted = data2.select('id,
        concat(data2.columns.map(data2(_)).drop(1):_*).contains("false").as("isCorrupted")
      )
      corrupted.show()
    
    +---+-----------+
    | id|isCorrupted|
    +---+-----------+
    |  1|      false|
    |  2|      false|
    |  3|      false|
    |  4|       true|
    |  5|       true|
    |  6|       true|
    |  7|       true|
    |  8|       true|
    |  9|       true|
    | 10|       true|
    +---+-----------+
    
    data.join(corrupted,"id").show()
    
    +---+----+---+-----------+----+-----------+
    | id|name|age|loaded_date| sex|isCorrupted|
    +---+----+---+-----------+----+-----------+
    |  1| ABC| 32| 2019-09-11|   M|      false|
    |  2|null| 33| 2019-09-11|   M|      false|
    |  3| XYZ| 35| 2019-08-11|   M|      false|
    |  4| PQR| 32| 2019-30-10|   M|       true|
    |  5| EFG| 32|       null|null|       true|
    |  6| DEF| 32| 2019/09/11|   M|       true|
    |  7| XYZ| 32| 2017-01-01|   9|       true|
    |  8| KLM| XX| 2017-01-01|   F|       true|
    |  9| ABC|3.2| 2019-10-10|   M|       true|
    | 10| ABC| 32| 2019-02-29|   M|       true|
    +---+----+---+-----------+----+-----------+
    
    0 讨论(0)
提交回复
热议问题