Is there anyway I can shuffle a column of an RDD or dataframe such that the entries in that column appear in random order? I\'m not sure which APIs I could use to accomplish
While one can not not just shuffle a single column directly - it is possible to permute the records in an RDD
via RandomRDDs
. https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/random/RandomRDDs.html
A potential approach to having only a single column permuted might be:
mapPartitions
to do some setup/teardown on each Worker taskiterator.toList
. Make sure you have many (/small) partitions of data to avoid OOMElist.toIterator
from the mapPartitions
What about selecting the column to shuffle, orderBy(rand)
the column and zip it by index to the existing dataframe?
import org.apache.spark.sql.functions.rand
def addIndex(df: DataFrame) = spark.createDataFrame(
// Add index
df.rdd.zipWithIndex.map{case (r, i) => Row.fromSeq(r.toSeq :+ i)},
// Create schema
StructType(df.schema.fields :+ StructField("_index", LongType, false))
)
case class Entry(name: String, salary: Double)
val r1 = Entry("Max", 2001.21)
val r2 = Entry("Zhang", 3111.32)
val r3 = Entry("Bob", 1919.21)
val r4 = Entry("Paul", 3001.5)
val df = addIndex(spark.createDataFrame(Seq(r1, r2, r3, r4)))
val df_shuffled = addIndex(df
.select(col("salary").as("salary_shuffled"))
.orderBy(rand))
df.join(df_shuffled, Seq("_index"))
.drop("_index")
.show(false)
+-----+-------+---------------+
|name |salary |salary_shuffled|
+-----+-------+---------------+
|Max |2001.21|3001.5 |
|Zhang|3111.32|3111.32 |
|Paul |3001.5 |2001.21 |
|Bob |1919.21|1919.21 |
+-----+-------+---------------+
You can add one additional column random generated, and then sort the record based on this random generated column. By this way, you are randomly shuffle your destined column.
In this way, you do not need to have all data in memory, which can easily cause OOM. Spark will take care of sorting and memory limitation issue by spill to disk if necessary.
If you don't want the extra column, you can remove it after sorting.
If you don't need a global shuffle across your data, you can shuffle within partitions using the mapPartitions
method.
rdd.mapPartitions(Random.shuffle(_));
For a PairRDD
(RDDs of type RDD[(K, V)]
), if you are interested in shuffling the key-value mappings (mapping an arbitrary key to an arbitrary value):
pairRDD.mapPartitions(iterator => {
val (keySequence, valueSequence) = iterator.toSeq.unzip
val shuffledValueSequence = Random.shuffle(valueSequence)
keySequence.zip(shuffledValueSequence).toIterator
}, true)
The boolean flag at the end denotes that partitioning is preserved (keys are not changed) for this operation so that downstream operations e.g. reduceByKey
can be optimized (avoid shuffles).