How to validate date format in a dataframe column in spark scala

前端 未结 3 1283
栀梦
栀梦 2021-01-20 05:56

I have a dataframe with one DateTime column and many other columns.

All I wanted to do is parse this DateTime column value and check if the format is \"yyyy-MM

相关标签:
3条回答
  • 2021-01-20 06:23

    Here we define a function for checking whether a String is compatible with your format requirements, and we partition the list into compatible/non pieces. The types are shown with full package names, but you should use import statements, of course.

    val fmt = "yyyy-MM-dd HH:mm:ss"
    val df = java.time.format.DateTimeFormatter.ofPattern(fmt)
    def isCompatible(s: String) = try {
      java.time.LocalDateTime.parse(s, df)
      true
    } catch {
      case e: java.time.format.DateTimeParseException => false
    }
    val dts = Seq("2016-11-07 15:16:17", "2016-11-07 24:25:26")
    val yesNo = dts.partition { s => isCompatible(s) }
    println(yesNo)
    
    0 讨论(0)
  • 2021-01-20 06:29

    You can use filter() to get the valid/invalid records in dataframe. This code can be improvable with scala point of view.

      val DATE_TIME_FORMAT = "yyyy-MM-dd HH:mm:ss"
    
      def validateDf(row: Row): Boolean = try {
        //assume row.getString(1) with give Datetime string
        java.time.LocalDateTime.parse(row.getString(1), java.time.format.DateTimeFormatter.ofPattern(DATE_TIME_FORMAT))
        true
      } catch {
        case ex: java.time.format.DateTimeParseException => {
          // Handle exception if you want
          false
        }
      }
    
    
    
    val session = SparkSession.builder
      .appName("Validate Dataframe")
      .getOrCreate
    
    val df = session. .... //Read from any datasource
    
    import session.implicits._ //implicits provide except() on df  
    
    val validDf = df.filter(validateDf(_))
    val inValidDf = df.except(validDf)
    
    0 讨论(0)
  • 2021-01-20 06:33

    Use option("dateFormat", "MM/dd/yyyy") to validate date field in dataframe.It will discard the invalid rows.

     val df=spark.read.format("csv").option("header", "false").
                option("dateFormat", "MM/dd/yyyy").
                schema(schema).load("D:/cca175/data/emp.csv")
    
    0 讨论(0)
提交回复
热议问题