Can unix_timestamp() return unix time in milliseconds in Apache Spark?

前端 未结 4 1647
情深已故
情深已故 2021-02-12 20:17

I\'m trying to get the unix time from a timestamp field in milliseconds (13 digits) but currently it returns in seconds (10 digits).

scala> var df = Seq(\"20         


        
相关标签:
4条回答
  • 2021-02-12 21:00

    Implementing the approach suggested in Dao Thi's answer

    import pyspark.sql.functions as F
    df = spark.createDataFrame([('22-Jul-2018 04:21:18.792 UTC', ),('23-Jul-2018 04:21:25.888 UTC',)], ['TIME'])
    df.show(2,False)
    df.printSchema()
    

    Output:

    +----------------------------+
    |TIME                        |
    +----------------------------+
    |22-Jul-2018 04:21:18.792 UTC|
    |23-Jul-2018 04:21:25.888 UTC|
    +----------------------------+
    root
    |-- TIME: string (nullable = true)
    

    Converting string time-format (including milliseconds ) to unix_timestamp(double). Extracting milliseconds from string using substring method (start_position = -7, length_of_substring=3) and Adding milliseconds seperately to unix_timestamp. (Cast to substring to float for adding)

    df1 = df.withColumn("unix_timestamp",F.unix_timestamp(df.TIME,'dd-MMM-yyyy HH:mm:ss.SSS z') + F.substring(df.TIME,-7,3).cast('float')/1000)
    

    Converting unix_timestamp(double) to timestamp datatype in Spark.

    df2 = df1.withColumn("TimestampType",F.to_timestamp(df1["unix_timestamp"]))
    df2.show(n=2,truncate=False)
    

    This will give you following output

    +----------------------------+----------------+-----------------------+
    |TIME                        |unix_timestamp  |TimestampType          |
    +----------------------------+----------------+-----------------------+
    |22-Jul-2018 04:21:18.792 UTC|1.532233278792E9|2018-07-22 04:21:18.792|
    |23-Jul-2018 04:21:25.888 UTC|1.532319685888E9|2018-07-23 04:21:25.888|
    +----------------------------+----------------+-----------------------+
    

    Checking the Schema:

    df2.printSchema()
    
    
    root
     |-- TIME: string (nullable = true)
     |-- unix_timestamp: double (nullable = true)
     |-- TimestampType: timestamp (nullable = true)
    
    0 讨论(0)
  • 2021-02-12 21:06

    Up to Spark version 3.0.1 it is not possible to convert a timestamp into unix time in milliseconds using the SQL built-in function unix_timestamp.

    According to the code on Spark's DateTimeUtils

    "Timestamps are exposed externally as java.sql.Timestamp and are stored internally as longs, which are capable of storing timestamps with microsecond precision."

    Therefore, if you define a UDF that has a java.sql.Timestamp as input you can call getTime for a Long in millisecond. If you apply unix_timestamp you will only get unix time with precision in seconds.

    val tsConversionToLongUdf = udf((ts: java.sql.Timestamp) => ts.getTime)
    

    Applying this to a variety of Timestamps:

    val df = Seq("2017-01-18 11:00:00.000", "2017-01-18 11:00:00.111", "2017-01-18 11:00:00.110", "2017-01-18 11:00:00.100")
      .toDF("timestampString")
      .withColumn("timestamp", to_timestamp(col("timestampString")))
      .withColumn("timestampConversionToLong", tsConversionToLongUdf(col("timestamp")))
      .withColumn("timestampUnixTimestamp", unix_timestamp(col("timestamp")))
    
    df.printSchema()
    df.show(false)
    
    // returns
    root
     |-- timestampString: string (nullable = true)
     |-- timestamp: timestamp (nullable = true)
     |-- timestampConversionToLong: long (nullable = false)
     |-- timestampCastAsLong: long (nullable = true)
    
    +-----------------------+-----------------------+-------------------------+-------------------+
    |timestampString        |timestamp              |timestampConversionToLong|timestampUnixTimestamp|
    +-----------------------+-----------------------+-------------------------+-------------------+
    |2017-01-18 11:00:00.000|2017-01-18 11:00:00    |1484733600000            |1484733600         |
    |2017-01-18 11:00:00.111|2017-01-18 11:00:00.111|1484733600111            |1484733600         |
    |2017-01-18 11:00:00.110|2017-01-18 11:00:00.11 |1484733600110            |1484733600         |
    |2017-01-18 11:00:00.100|2017-01-18 11:00:00.1  |1484733600100            |1484733600         |
    +-----------------------+-----------------------+-------------------------+-------------------+
    
    0 讨论(0)
  • 2021-02-12 21:20

    unix_timestamp() return unix timestamp in seconds.

    The last 3 digits in the timestamps are the same with the last 3 digits of the milliseconds string (1.999sec = 1999 milliseconds), so just take the last 3 digits of the timestamps string and append to the end of the milliseconds string.

    0 讨论(0)
  • 2021-02-12 21:21

    Milliseconds hide in fraction part timestamp format

    Try this:

    df = df.withColumn("time_in_milliseconds", col("time").cast("double"))
    

    You'll get something like 1484758800.792, where 792 it's milliseconds

    At least it's works for me (Scala, Spark, Hive)

    0 讨论(0)
提交回复
热议问题