What are the differences between slices and partitions of RDDs?

后端 未结 2 1026
星月不相逢
星月不相逢 2021-02-08 00:21

I am using Spark\'s Python API and running Spark 0.8.

I am storing a large RDD of floating point vectors and I need to perform calculations of one vector against the ent

相关标签:
2条回答
  • 2021-02-08 01:01

    You can do partition as follows:

    import org.apache.spark.Partitioner
    
    val p = new Partitioner() {
      def numPartitions = 2
      def getPartition(key: Any) = key.asInstanceOf[Int]
    }
    recordRDD.partitionBy(p)
    
    0 讨论(0)
  • 2021-02-08 01:16

    I believe slices and partitions are the same thing in Apache Spark.

    However, there is a subtle but potentially significant difference between the two pieces of code you posted.

    This code will attempt to load demo.txt directly into 100 partitions using 100 concurrent tasks:

    rdd = sc.textFile('demo.txt', 100)
    

    For uncompressed text, it will work as expected. But if instead of demo.txt you had a demo.gz, you will end up with an RDD with only 1 partition. Reads against gzipped files cannot be parallelized.

    On the other hand, the following code will first open demo.txt into an RDD with the default number of partitions, then it will explicitly repartition the data into 100 partitions that are roughly equal in size.

    rdd = sc.textFile('demo.txt')
    rdd = rdd.repartition(100)
    

    So in this case, even with a demo.gz you will end up with an RDD with 100 partitions.

    As a side note, I replaced your partitionBy() with repartition() since that's what I believe you were looking for. partitionBy() requires the RDD to be an RDD of tuples. Since repartition() is not available in Spark 0.8.0, you should instead be able to use coalesce(100, shuffle=True).

    Spark can run 1 concurrent task for every partition of an RDD, up to the number of cores in your cluster. So if you have a cluster with 50 cores, you want your RDDs to at least have 50 partitions (and probably 2-3x times that).

    As of Spark 1.1.0, you can check how many partitions an RDD has as follows:

    rdd.getNumPartitions()  # Python API
    rdd.partitions.size     // Scala API
    

    Before 1.1.0, the way to do this with the Python API was rdd._jrdd.splits().size().

    0 讨论(0)
提交回复
热议问题