Creating array per Executor in Spark and combine into RDD

前端 未结 1 1265
佛祖请我去吃肉
佛祖请我去吃肉 2021-01-14 07:03

I am moving from MPI based systems to Apache Spark. I need to do the following in Spark.

Suppose, I have n vertices. I want to create an edge list from

1条回答
  •  执笔经年
    2021-01-14 07:23

    Lets start with some imports and variables which will be required for downstream processing:

    import org.apache.spark._
    import org.apache.spark.graphx._
    import org.apache.spark.rdd.RDD
    import scala.util.Random
    import org.apache.spark.HashPartitioner
    
    val nPartitions: Integer = ???
    val n: Long = ??? 
    val p: Double = ???
    

    Next we'll need an RDD of seed IDs which can be used to generate edges. A naive way to handle this would be simply something like this:

    sc.parallelize(0L to n)
    

    Since number of the generated edges depends on the node id this approach would give a highly skewed load. We can do a little bit better with repartitioning:

    sc.parallelize(0L to n)
      .map((_, None))
      .partitionBy(new HashPartitioner(nPartitions))
      .keys
    

    but much better approach is to start with empty RDD and generate ids in place. We'll need a small helper:

    def genNodeIds(nPartitions: Int, n: Long)(i: Int) = {
      (0L until n).filter(_ % nPartitions == i).toIterator
    }
    

    which can be used as follows:

    val empty = sc.parallelize(Seq.empty[Int], nPartitions)
    val ids = empty.mapPartitionsWithIndex((i, _) => genNodeIds(nPartitions, n)(i))
    

    Just a quick sanity check (it is quite expensive so don't use it in production):

    require(ids.distinct.count == n) 
    

    and we can generate actual edges using another helper:

    def genEdgesForId(p: Double, n: Long, random: Random)(i: Long) = {
      (i + 1 until n).filter(_ => random.nextDouble < p).map(j => Edge(i, j, ()))
    }
    
    def genEdgesForPartition(iter: Iterator[Long]) = {
      // It could be an overkill but better safe than sorry
      // Depending on your requirement it could worth to
      // consider using commons-math
      // https://commons.apache.org/proper/commons-math/userguide/random.html
      val random = new Random(new java.security.SecureRandom())
      iter.flatMap(genEdgesForId(p, n, random))
    }
    
    val edges = ids.mapPartitions(genEdgesForPartition)
    

    Finally we can create a graph:

    val graph = Graph.fromEdges(edges, ())
    

    0 讨论(0)
提交回复
热议问题