How to validate date format in a dataframe column in spark scala

前端未结

关注

 3  1283

I have a dataframe with one DateTime column and many other columns.

All I wanted to do is parse this DateTime column value and check if the format is \"yyyy-MM


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  执笔经年        
                
              
                            
                2021-01-20 06:23
              
            
            
                                                                       
Here we define a function for checking whether a String is compatible with your format requirements, and we partition the list into compatible/non pieces. The types are shown with full package names, but you should use import statements, of course.

val fmt = "yyyy-MM-dd HH:mm:ss"
val df = java.time.format.DateTimeFormatter.ofPattern(fmt)
def isCompatible(s: String) = try {
  java.time.LocalDateTime.parse(s, df)
  true
} catch {
  case e: java.time.format.DateTimeParseException => false
}
val dts = Seq("2016-11-07 15:16:17", "2016-11-07 24:25:26")
val yesNo = dts.partition { s => isCompatible(s) }
println(yesNo)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  臣服心动        
                
              
                            
                2021-01-20 06:29
              
            
            
                                                                       
You can use filter() to get the valid/invalid records in dataframe. This code can be improvable with scala point of view.

  val DATE_TIME_FORMAT = "yyyy-MM-dd HH:mm:ss"

  def validateDf(row: Row): Boolean = try {
    //assume row.getString(1) with give Datetime string
    java.time.LocalDateTime.parse(row.getString(1), java.time.format.DateTimeFormatter.ofPattern(DATE_TIME_FORMAT))
    true
  } catch {
    case ex: java.time.format.DateTimeParseException => {
      // Handle exception if you want
      false
    }
  }



val session = SparkSession.builder
  .appName("Validate Dataframe")
  .getOrCreate

val df = session. .... //Read from any datasource

import session.implicits._ //implicits provide except() on df  

val validDf = df.filter(validateDf(_))
val inValidDf = df.except(validDf)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  暗喜        
                
              
                            
                2021-01-20 06:33
              
            
            
                                                                       
Use option("dateFormat", "MM/dd/yyyy") to validate date field in dataframe.It will discard the invalid rows.

 val df=spark.read.format("csv").option("header", "false").
            option("dateFormat", "MM/dd/yyyy").
            schema(schema).load("D:/cca175/data/emp.csv")

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复