I am trying to solve the age-old problem of adding a sequence number to a data set. I am working with DataFrames, and there appears to be no DataFrame equivalent to RD
Starting in Spark 1.5, Window
expressions were added to Spark. Instead of having to convert the DataFrame
to an RDD
, you can now use org.apache.spark.sql.expressions.row_number
. Note that I found performance for the the above dfZipWithIndex
to be significantly faster than the below algorithm. But I am posting it because:
At any rate, here's what works for me:
import org.apache.spark.sql.expressions._
df.withColumn("row_num", row_number.over(Window.partitionBy(lit(1)).orderBy(lit(1))))
Note that I use lit(1)
for both the partitioning and the ordering -- this makes everything be in the same partition, and seems to preserve the original ordering of the DataFrame
, but I suppose it is what slows it way down.
I tested it on a 4-column DataFrame
with 7,000,000 rows and the speed difference is significant between this and the above dfZipWithIndex
(like I said, the RDD
functions is much, much faster).
Spark Java API version:
I have implemented @Evgeny's solution for performing zipWithIndex on DataFrames in Java and wanted to share the code.
It also contains the improvements offered by @fylb in his solution. I can confirm for Spark 2.4 that the execution fails when the entries returned by spark_partition_id() do not start with 0 or do not increase sequentially. As this function is documented to be non-deterministic, it is very likely that one of the above cases will occur. One example is triggered by increasing the partition count.
The java implementation is given below:
public static Dataset<Row> zipWithIndex(Dataset<Row> df, Long offset, String indexName) {
Dataset<Row> dfWithPartitionId = df
.withColumn("partition_id", spark_partition_id())
.withColumn("inc_id", monotonically_increasing_id());
Object partitionOffsetsObject = dfWithPartitionId
.groupBy("partition_id")
.agg(count(lit(1)).alias("cnt"), first("inc_id").alias("inc_id"))
.orderBy("partition_id")
.select(col("partition_id"), sum("cnt").over(Window.orderBy("partition_id")).minus(col("cnt")).minus(col("inc_id")).plus(lit(offset).alias("cnt")))
.collect();
Row[] partitionOffsetsArray = ((Row[]) partitionOffsetsObject);
Map<Integer, Long> partitionOffsets = new HashMap<>();
for (int i = 0; i < partitionOffsetsArray.length; i++) {
partitionOffsets.put(partitionOffsetsArray[i].getInt(0), partitionOffsetsArray[i].getLong(1));
}
UserDefinedFunction getPartitionOffset = udf(
(partitionId) -> partitionOffsets.get((Integer) partitionId), DataTypes.LongType
);
return dfWithPartitionId
.withColumn("partition_offset", getPartitionOffset.apply(col("partition_id")))
.withColumn(indexName, col("partition_offset").plus(col("inc_id")))
.drop("partition_id", "partition_offset", "inc_id");
}
The following was posted on behalf of the David Griffin (edited out of question).
The all-singing, all-dancing dfZipWithIndex method. You can set the starting offset (which defaults to 1), the index column name (defaults to "id"), and place the column in the front or the back:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types.{LongType, StructField, StructType}
import org.apache.spark.sql.Row
def dfZipWithIndex(
df: DataFrame,
offset: Int = 1,
colName: String = "id",
inFront: Boolean = true
) : DataFrame = {
df.sqlContext.createDataFrame(
df.rdd.zipWithIndex.map(ln =>
Row.fromSeq(
(if (inFront) Seq(ln._2 + offset) else Seq())
++ ln._1.toSeq ++
(if (inFront) Seq() else Seq(ln._2 + offset))
)
),
StructType(
(if (inFront) Array(StructField(colName,LongType,false)) else Array[StructField]())
++ df.schema.fields ++
(if (inFront) Array[StructField]() else Array(StructField(colName,LongType,false)))
)
)
}
PySpark version:
from pyspark.sql.types import LongType, StructField, StructType
def dfZipWithIndex (df, offset=1, colName="rowId"):
'''
Enumerates dataframe rows is native order, like rdd.ZipWithIndex(), but on a dataframe
and preserves a schema
:param df: source dataframe
:param offset: adjustment to zipWithIndex()'s index
:param colName: name of the index column
'''
new_schema = StructType(
[StructField(colName,LongType(),True)] # new added field in front
+ df.schema.fields # previous schema
)
zipped_rdd = df.rdd.zipWithIndex()
new_rdd = zipped_rdd.map(lambda (row,rowId): ([rowId +offset] + list(row)))
return spark.createDataFrame(new_rdd, new_schema)
Also created a jira to add this functionality in Spark natively: https://issues.apache.org/jira/browse/SPARK-23074
@Evgeny , your solution is interesting. Notice that there is a bug when you have empty partitions (the array is missing these partition indexes, at least this is happening to me with spark 1.6), so I converted the array into a Map(partitionId -> offsets).
Additionnally, I took out the sources of monotonically_increasing_id to have "inc_id" starting from 0 in each partition.
Here is an updated version:
import org.apache.spark.sql.catalyst.expressions.LeafExpression
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.types.LongType
import org.apache.spark.sql.catalyst.expressions.Nondeterministic
import org.apache.spark.sql.catalyst.expressions.codegen.GeneratedExpressionCode
import org.apache.spark.sql.catalyst.expressions.codegen.CodeGenContext
import org.apache.spark.sql.types.DataType
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
import org.apache.spark.sql.expressions.Window
case class PartitionMonotonicallyIncreasingID() extends LeafExpression with Nondeterministic {
/**
* From org.apache.spark.sql.catalyst.expressions.MonotonicallyIncreasingID
*
* Record ID within each partition. By being transient, count's value is reset to 0 every time
* we serialize and deserialize and initialize it.
*/
@transient private[this] var count: Long = _
override protected def initInternal(): Unit = {
count = 1L // notice this starts at 1, not 0 as in org.apache.spark.sql.catalyst.expressions.MonotonicallyIncreasingID
}
override def nullable: Boolean = false
override def dataType: DataType = LongType
override protected def evalInternal(input: InternalRow): Long = {
val currentCount = count
count += 1
currentCount
}
override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = {
val countTerm = ctx.freshName("count")
ctx.addMutableState(ctx.JAVA_LONG, countTerm, s"$countTerm = 1L;")
ev.isNull = "false"
s"""
final ${ctx.javaType(dataType)} ${ev.value} = $countTerm;
$countTerm++;
"""
}
}
object DataframeUtils {
def zipWithIndex(df: DataFrame, offset: Long = 0, indexName: String = "index") = {
// from https://stackoverflow.com/questions/30304810/dataframe-ified-zipwithindex)
val dfWithPartitionId = df.withColumn("partition_id", spark_partition_id()).withColumn("inc_id", new Column(PartitionMonotonicallyIncreasingID()))
// collect each partition size, create the offset pages
val partitionOffsets: Map[Int, Long] = dfWithPartitionId
.groupBy("partition_id")
.agg(max("inc_id") as "cnt") // in each partition, count(inc_id) is equal to max(inc_id) (I don't know which one would be faster)
.select(col("partition_id"), sum("cnt").over(Window.orderBy("partition_id")) - col("cnt") + lit(offset) as "cnt")
.collect()
.map(r => (r.getInt(0) -> r.getLong(1)))
.toMap
def partition_offset(partitionId: Int): Long = partitionOffsets(partitionId)
val partition_offset_udf = udf(partition_offset _)
// and re-number the index
dfWithPartitionId
.withColumn("partition_offset", partition_offset_udf(col("partition_id")))
.withColumn(indexName, col("partition_offset") + col("inc_id"))
.drop("partition_id")
.drop("partition_offset")
.drop("inc_id")
}
}
I have modified @Tagar's version to run on Python 3.7, wanted to share:
def dfZipWithIndex (df, offset=1, colName="rowId"):
'''
Enumerates dataframe rows is native order, like rdd.ZipWithIndex(), but on a dataframe
and preserves a schema
:param df: source dataframe
:param offset: adjustment to zipWithIndex()'s index
:param colName: name of the index column
'''
new_schema = StructType(
[StructField(colName,LongType(),True)] # new added field in front
+ df.schema.fields # previous schema
)
zipped_rdd = df.rdd.zipWithIndex()
new_rdd = zipped_rdd.map(lambda args: ([args[1] + offset] + list(args[0]))) # use this for python 3+, tuple gets passed as single argument so using args and [] notation to read elements within args
return spark.createDataFrame(new_rdd, new_schema)