I am trying to solve the age-old problem of adding a sequence number to a data set. I am working with DataFrames, and there appears to be no DataFrame equivalent to RD
Since Spark 1.6 there is a function called monotonically_increasing_id()
It generates a new column with unique 64-bit monotonic index for each row
But it isn't consequential, each partition starts a new range, so we must calculate each partition offset before using it.
Trying to provide an "rdd-free" solution, I ended up with some collect(), but it only collects offsets, one value per partition, so it will not cause OOM
def zipWithIndex(df: DataFrame, offset: Long = 1, indexName: String = "index") = {
val dfWithPartitionId = df.withColumn("partition_id", spark_partition_id()).withColumn("inc_id", monotonically_increasing_id())
val partitionOffsets = dfWithPartitionId
.groupBy("partition_id")
.agg(count(lit(1)) as "cnt", first("inc_id") as "inc_id")
.orderBy("partition_id")
.select(sum("cnt").over(Window.orderBy("partition_id")) - col("cnt") - col("inc_id") + lit(offset) as "cnt" )
.collect()
.map(_.getLong(0))
.toArray
dfWithPartitionId
.withColumn("partition_offset", udf((partitionId: Int) => partitionOffsets(partitionId), LongType)(col("partition_id")))
.withColumn(indexName, col("partition_offset") + col("inc_id"))
.drop("partition_id", "partition_offset", "inc_id")
}
This solution doesn't repack the original rows and doesn't repartition the original huge dataframe, so it is quite fast in real world:
200GB of CSV data (43 million rows with 150 columns) read, indexed and packed to parquet in 2 minutes on 240 cores
After testing my solution, I have run Kirk Broadhurst's solution and it was 20 seconds slower
You may want or not want to use dfWithPartitionId.cache()
, depends on task
Here is my proposal, the advantages of which are:
DataFrame
's InternalRow
s.RDD.zipWithIndex
.Its major down sides are:
package org.apache.spark.sql;
.imports:
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.execution.LogicalRDD
import org.apache.spark.sql.functions.lit
/**
* Optimized Spark SQL equivalent of RDD.zipWithIndex.
*
* @param df
* @param indexColName
* @return `df` with a column named `indexColName` of consecutive unique ids.
*/
def zipWithIndex(df: DataFrame, indexColName: String = "index"): DataFrame = {
import df.sparkSession.implicits._
val dfWithIndexCol: DataFrame = df
.drop(indexColName)
.select(lit(0L).as(indexColName), $"*")
val internalRows: RDD[InternalRow] = dfWithIndexCol
.queryExecution
.toRdd
.zipWithIndex()
.map {
case (internalRow: InternalRow, index: Long) =>
internalRow.setLong(0, index)
internalRow
}
Dataset.ofRows(
df.sparkSession,
LogicalRDD(dfWithIndexCol.schema.toAttributes, internalRows)(df.sparkSession)
)
[1]: (from/to InternalRow
's underlying bytes array <--> GenericRow
's underlying JVM objects collection Array[Any]
).