Can unix_timestamp() return unix time in milliseconds in Apache Spark?

前端未结

关注

 4  1655

I\'m trying to get the unix time from a timestamp field in milliseconds (13 digits) but currently it returns in seconds (10 digits).

scala> var df = Seq(\"20


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  离开以前        
                
              
                            
                2021-02-12 21:00
              
            
            
                                                                       
Implementing the approach suggested in Dao Thi's answer   

import pyspark.sql.functions as F
df = spark.createDataFrame([('22-Jul-2018 04:21:18.792 UTC', ),('23-Jul-2018 04:21:25.888 UTC',)], ['TIME'])
df.show(2,False)
df.printSchema()


Output:

+----------------------------+
|TIME                        |
+----------------------------+
|22-Jul-2018 04:21:18.792 UTC|
|23-Jul-2018 04:21:25.888 UTC|
+----------------------------+
root
|-- TIME: string (nullable = true)


Converting string time-format (including milliseconds ) to unix_timestamp(double). Extracting milliseconds from string using substring method (start_position = -7, length_of_substring=3) and Adding milliseconds seperately to unix_timestamp. (Cast  to substring to float for adding)

df1 = df.withColumn("unix_timestamp",F.unix_timestamp(df.TIME,'dd-MMM-yyyy HH:mm:ss.SSS z') + F.substring(df.TIME,-7,3).cast('float')/1000)


Converting unix_timestamp(double) to timestamp datatype in Spark. 

df2 = df1.withColumn("TimestampType",F.to_timestamp(df1["unix_timestamp"]))
df2.show(n=2,truncate=False)


This will give you following output

+----------------------------+----------------+-----------------------+
|TIME                        |unix_timestamp  |TimestampType          |
+----------------------------+----------------+-----------------------+
|22-Jul-2018 04:21:18.792 UTC|1.532233278792E9|2018-07-22 04:21:18.792|
|23-Jul-2018 04:21:25.888 UTC|1.532319685888E9|2018-07-23 04:21:25.888|
+----------------------------+----------------+-----------------------+


Checking the Schema:

df2.printSchema()


root
 |-- TIME: string (nullable = true)
 |-- unix_timestamp: double (nullable = true)
 |-- TimestampType: timestamp (nullable = true)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  故里飘歌        
                
              
                            
                2021-02-12 21:06
              
            
            
                                                                       
Up to Spark version 3.0.1 it is not possible to convert a timestamp into unix time in milliseconds using the SQL built-in function unix_timestamp.
According to the code on Spark's DateTimeUtils

"Timestamps are exposed externally as java.sql.Timestamp and are stored internally as longs, which are capable of storing timestamps with microsecond precision."

Therefore, if you define a UDF that has a java.sql.Timestamp as input you can  call getTime for a Long in millisecond. If you apply unix_timestamp you will only get unix time with precision in seconds.
val tsConversionToLongUdf = udf((ts: java.sql.Timestamp) => ts.getTime)

Applying this to a variety of Timestamps:
val df = Seq("2017-01-18 11:00:00.000", "2017-01-18 11:00:00.111", "2017-01-18 11:00:00.110", "2017-01-18 11:00:00.100")
  .toDF("timestampString")
  .withColumn("timestamp", to_timestamp(col("timestampString")))
  .withColumn("timestampConversionToLong", tsConversionToLongUdf(col("timestamp")))
  .withColumn("timestampUnixTimestamp", unix_timestamp(col("timestamp")))

df.printSchema()
df.show(false)

// returns
root
 |-- timestampString: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampConversionToLong: long (nullable = false)
 |-- timestampCastAsLong: long (nullable = true)

+-----------------------+-----------------------+-------------------------+-------------------+
|timestampString        |timestamp              |timestampConversionToLong|timestampUnixTimestamp|
+-----------------------+-----------------------+-------------------------+-------------------+
|2017-01-18 11:00:00.000|2017-01-18 11:00:00    |1484733600000            |1484733600         |
|2017-01-18 11:00:00.111|2017-01-18 11:00:00.111|1484733600111            |1484733600         |
|2017-01-18 11:00:00.110|2017-01-18 11:00:00.11 |1484733600110            |1484733600         |
|2017-01-18 11:00:00.100|2017-01-18 11:00:00.1  |1484733600100            |1484733600         |
+-----------------------+-----------------------+-------------------------+-------------------+

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  说谎        
                
              
                            
                2021-02-12 21:20
              
            
            
                                                                       
unix_timestamp() return unix timestamp in seconds.

The last 3 digits in the timestamps are the same with the last 3 digits of the milliseconds string (1.999sec = 1999 milliseconds), so just take the last 3 digits of the timestamps string and append to the end of the milliseconds string.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  礼貌的吻别        
                
              
                            
                2021-02-12 21:21
              
            
            
                                                                       
Milliseconds hide in fraction part timestamp format

Try this: 

df = df.withColumn("time_in_milliseconds", col("time").cast("double"))


You'll get something like 1484758800.792, where 792 it's milliseconds

At least it's works for me (Scala, Spark, Hive) 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复