Java & Spark : add unique incremental id to dataset

浪尽此生 提交于 2021-02-19 00:47:35

问题


With Spark and Java, I am trying to add to an existing Dataset[Row] with n columns an Integer identify column.

I successfully added an id with zipWithUniqueId() or with zipWithIndex, even using monotonically_increasing_id(). But neither one gives satisfaction.

Example : I have one dataset with 195 rows. When I use one of these three methods, i get some id like 1584156487 or 12036. Plus, those id's are not contiguous.

What i need/want is rather simply : an Integer id column, which value goes 1 to dataset.count() foreach row, where id = 1 is followed by id = 2, etc.

How can I do that in Java/Spark ?


回答1:


You can try to use the row_number function :

In java :

import org.apache.spark.sql.functions;
import org.apache.spark.sql.expressions.Window;

df.withColumn("id", functions.row_number().over(Window.orderBy("a column")));

Or in scala :

import org.apache.spark.sql.expressions.Window;
df.withColumn("id",row_number().over(Window.orderBy("a column")))



回答2:


If you wish to use streaming data frames, you can use a udf with guid generator:

val generateUuid = udf(() => java.util.UUID.randomUUID.toString())

// Cast the data as string (it comes in as binary by default)
val ddfStream = ddfStream.withColumn("uniqueId", generateUuid())



回答3:


In Scala you can do it as below.

 var a = dataframe.collect().zipWithIndex
    for (  b:(Row,Int)<-a){
      println(b._2) 

    }

Here b._2 you will get unique number starting from 0 to till count of row -1




回答4:


You can also generate a unique increasing id as below

val df1 = spark.sqlContext.createDataFrame(
    df.rdd.zipWithIndex.map {
  case (row, index) => Row.fromSeq(row.toSeq :+ index)
},
StructType(df.schema.fields :+ StructField("id", LongType, false)))

Hope this helps!



来源:https://stackoverflow.com/questions/45480208/java-spark-add-unique-incremental-id-to-dataset

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!