Better way to convert a string field into timestamp in Spark

前端 未结 7 736
独厮守ぢ
独厮守ぢ 2020-11-27 16:29

I have a CSV in which a field is datetime in a specific format. I cannot import it directly in my Dataframe because it needs to be a timestamp. So I import it as string and

相关标签:
7条回答
  • 2020-11-27 17:02

    I would use https://github.com/databricks/spark-csv

    This will infer timestamps for you.

    import com.databricks.spark.csv._
    val rdd: RDD[String] = sc.textFile("csvfile.csv")
    
    val df : DataFrame = new CsvParser().withDelimiter('|')
          .withInferSchema(true)
          .withParseMode("DROPMALFORMED")
          .csvRdd(sqlContext, rdd)
    
    0 讨论(0)
  • 2020-11-27 17:04

    Spark >= 2.2

    Since you 2.2 you can provide format string directly:

    import org.apache.spark.sql.functions.to_timestamp
    
    val ts = to_timestamp($"dts", "MM/dd/yyyy HH:mm:ss")
    
    df.withColumn("ts", ts).show(2, false)
    
    // +---+-------------------+-------------------+
    // |id |dts                |ts                 |
    // +---+-------------------+-------------------+
    // |1  |05/26/2016 01:01:01|2016-05-26 01:01:01|
    // |2  |#$@#@#             |null               |
    // +---+-------------------+-------------------+
    

    Spark >= 1.6, < 2.2

    You can use date processing functions which have been introduced in Spark 1.5. Assuming you have following data:

    val df = Seq((1L, "05/26/2016 01:01:01"), (2L, "#$@#@#")).toDF("id", "dts")
    

    You can use unix_timestamp to parse strings and cast it to timestamp

    import org.apache.spark.sql.functions.unix_timestamp
    
    val ts = unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss").cast("timestamp")
    
    df.withColumn("ts", ts).show(2, false)
    
    // +---+-------------------+---------------------+
    // |id |dts                |ts                   |
    // +---+-------------------+---------------------+
    // |1  |05/26/2016 01:01:01|2016-05-26 01:01:01.0|
    // |2  |#$@#@#             |null                 |
    // +---+-------------------+---------------------+
    

    As you can see it covers both parsing and error handling. The format string should be compatible with Java SimpleDateFormat.

    Spark >= 1.5, < 1.6

    You'll have to use use something like this:

    unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss").cast("double").cast("timestamp")
    

    or

    (unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss") * 1000).cast("timestamp")
    

    due to SPARK-11724.

    Spark < 1.5

    you should be able to use these with expr and HiveContext.

    0 讨论(0)
  • 2020-11-27 17:05

    I have ISO8601 timestamp in my dataset and I needed to convert it to "yyyy-MM-dd" format. This is what I did:

    import org.joda.time.{DateTime, DateTimeZone}
    object DateUtils extends Serializable {
      def dtFromUtcSeconds(seconds: Int): DateTime = new DateTime(seconds * 1000L, DateTimeZone.UTC)
      def dtFromIso8601(isoString: String): DateTime = new DateTime(isoString, DateTimeZone.UTC)
    }
    
    sqlContext.udf.register("formatTimeStamp", (isoTimestamp : String) => DateUtils.dtFromIso8601(isoTimestamp).toString("yyyy-MM-dd"))
    

    And you can just use the UDF in your spark SQL query.

    0 讨论(0)
  • 2020-11-27 17:07

    Spark Version: 2.4.4

    scala> import org.apache.spark.sql.types.TimestampType
    import org.apache.spark.sql.types.TimestampType
    
    scala> val df = Seq("2019-04-01 08:28:00").toDF("ts")
    df: org.apache.spark.sql.DataFrame = [ts: string]
    
    scala> val df_mod = df.select($"ts".cast(TimestampType))
    df_mod: org.apache.spark.sql.DataFrame = [ts: timestamp]
    
    scala> df_mod.printSchema()
    root
     |-- ts: timestamp (nullable = true)
    
    0 讨论(0)
  • 2020-11-27 17:08

    I would like to move the getTimeStamp method wrote by you into rdd's mapPartitions and reuse GenericMutableRow among rows in an iterator:

    val strRdd = sc.textFile("hdfs://path/to/cvs-file")
    val rowRdd: RDD[Row] = strRdd.map(_.split('\t')).mapPartitions { iter =>
      new Iterator[Row] {
        val row = new GenericMutableRow(4)
        var current: Array[String] = _
    
        def hasNext = iter.hasNext
        def next() = {
          current = iter.next()
          row(0) = current(0)
          row(1) = current(1)
          row(2) = current(2)
    
          val ts = getTimestamp(current(3))
          if(ts != null) {
            row.update(3, ts)
          } else {
            row.setNullAt(3)
          }
          row
        }
      }
    }
    

    And you should still use schema to generate a DataFrame

    val df = sqlContext.createDataFrame(rowRdd, tableSchema)
    

    The usage of GenericMutableRow inside an iterator implementation could be find in Aggregate Operator, InMemoryColumnarTableScan, ParquetTableOperations etc.

    0 讨论(0)
  • 2020-11-27 17:20

    I had some issues with to_timestamp where it was returning an empty string. After a lot of trial and error, I was able to get around it by casting as a timestamp, and then casting back as a string. I hope this helps for anyone else with the same issue:

    df.columns.intersect(cols).foldLeft(df)((newDf, col) => {
      val conversionFunc = to_timestamp(newDf(col).cast("timestamp"), "MM/dd/yyyy HH:mm:ss").cast("string")
      newDf.withColumn(col, conversionFunc)
    })
    
    0 讨论(0)
提交回复
热议问题