Cast column containing multiple string date formats to DateTime in Spark

前端 未结 2 1887
野趣味
野趣味 2020-11-28 15:26

I have a date column in my Spark DataDrame that contains multiple string formats. I would like to cast these to DateTime.

The two formats in my column a

相关标签:
2条回答
  • 2020-11-28 15:49

    You can do this in 100% sql with something like this:

    create database delete_me;
    use delete_me;
    create table test (enc_date string);
    
    insert into test values ('10/28/2019');
    insert into test values ('2020-03-31 00:00:00.000');
    insert into test values ('2019-10-18');
    insert into test values ('gobledie-gook');
    insert into test values ('');
    insert into test values (null);
    insert into test values ('NULL');
    
    -- you might need the following line depending on your version of spark
    -- set spark.sql.legacy.timeParserPolicy = LEGACY;
    select enc_date, coalesce(to_date(enc_date, "yyyy-MM-dd"), to_date(enc_date, "MM/dd/yyyy")) as date from test;
    
    
    enc_date                    date
    --------                    ----
    2020-03-31 00:00:00.000     2020-03-31
    2019-10-18                  2019-10-18
    null                        null
    10/28/2019                  2019-10-28
    gobledie-gook               null
    NULL                        null
                                null
    
    0 讨论(0)
  • 2020-11-28 15:53

    Personally I would recommend using SQL functions directly without expensive and inefficient reformatting:

    from pyspark.sql.functions import coalesce, to_date
    
    def to_date_(col, formats=("MM/dd/yyyy", "yyyy-MM-dd")):
        # Spark 2.2 or later syntax, for < 2.2 use unix_timestamp and cast
        return coalesce(*[to_date(col, f) for f in formats])
    

    This will choose the first format, which can successfully parse input string.

    Usage:

    df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-01")], ("id", "dt"))
    df.withColumn("pdt", to_date_("dt")).show()
    
    +---+----------+----------+
    | id|        dt|       pdt|
    +---+----------+----------+
    |  1|01/22/2010|2010-01-22|
    |  2|2018-12-01|2018-12-01|
    +---+----------+----------+
    

    It will be faster than udf, and adding new formats is just a matter of adjusting formats parameter.

    However it won't help you with format ambiguities. In general case it might not be possible to do it without manual intervention and cross referencing with external data.

    The same thing can be of course done in Scala:

    import org.apache.spark.sql.Column
    import org.apache.spark.sql.functions.{coalesce, to_date}
    
    def to_date_(col: Column, 
                 formats: Seq[String] = Seq("MM/dd/yyyy", "yyyy-MM-dd")) = {
      coalesce(formats.map(f => to_date(col, f)): _*)
    }
    
    0 讨论(0)
提交回复
热议问题