Add a new column to a Dataframe. New column i want it to be a UUID generator

前端 未结 3 2090
醉酒成梦
醉酒成梦 2020-12-06 03:05

I want to add a new column to a Dataframe, a UUID generator.

UUID value will look something like 21534cf7-cff9-482a-a3a8-9e7244240da7

My Researc

相关标签:
3条回答
  • 2020-12-06 03:39

    You can utilize built-in Spark SQL uuid function:

    .withColumn("uuid", expr("uuid()"))
    

    A full example in Scala:

    import org.apache.spark.sql.SparkSession
    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.types._
    
    object CreateDf extends App {
    
      val spark = SparkSession.builder
        .master("local[*]")
        .appName("spark_local")
        .getOrCreate()
      import spark.implicits._
    
      Seq(1, 2, 3).toDF("col1")
        .withColumn("uuid", expr("uuid()"))
        .show(false)
    
    }
    

    Output:

    +----+------------------------------------+
    |col1|uuid                                |
    +----+------------------------------------+
    |1   |24181c68-51b7-42ea-a9fd-f88dcfa10062|
    |2   |7cd21b25-017e-4567-bdd3-f33b001ee497|
    |3   |1df7cfa8-af8a-4421-834f-5359dc3ae417|
    +----+------------------------------------+
    
    0 讨论(0)
  • 2020-12-06 03:54

    This is how we did in Java, we had a column date and wanted to add another column with month.

    Dataset<Row> newData = data.withColumn("month", month((unix_timestamp(col("date"), "MM/dd/yyyy")).cast("timestamp")));
    

    You can use similar technique to add any column.

    Dataset<Row> newData1 = newData.withColumn("uuid", lit(UUID.randomUUID().toString()));
    

    Cheers !

    0 讨论(0)
  • 2020-12-06 03:56

    You should try something like this:

    val sc: SparkContext = ...
    val sqlContext = new SQLContext(sc)
    
    import sqlContext.implicits._
    
    val generateUUID = udf(() => UUID.randomUUID().toString)
    val df1 = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
    val df2 = df1.withColumn("UUID", generateUUID())
    
    df1.show()
    df2.show()
    

    Output will be:

    +---+-----+
    | id|value|
    +---+-----+
    |id1|    1|
    |id2|    4|
    |id3|    5|
    +---+-----+
    
    +---+-----+--------------------+
    | id|value|                UUID|
    +---+-----+--------------------+
    |id1|    1|f0cfd0e2-fbbe-40f...|
    |id2|    4|ec8db8b9-70db-46f...|
    |id3|    5|e0e91292-1d90-45a...|
    +---+-----+--------------------+
    
    0 讨论(0)
提交回复
热议问题