Why is this simple Spark program not utlizing multiple cores?

前端 未结 4 1128
情深已故
情深已故 2021-02-10 01:39

So, I\'m running this simple program on a 16 core multicore system. I run it by issuing the following.

spark-submit --master local[*] pi.py

And

4条回答
  •  野性不改
    2021-02-10 02:20

    Probably because the call to sc.parallelize puts all the data into one single partition. You can specify the number of partitions as 2nd argument to parallelize:

    part = 16
    count = sc.parallelize(xrange(N), part).map(sample).reduce(lambda a, b: a + b)
    

    Note that this would still generate the 12 millions points with one CPU in the driver and then only spread them out to 16 partitions to perform the reduce step.

    A better approach would try to do most of the work after the partitioning: for example the following generates only a tiny array on the driver and then lets each remote task generate the actual random numbers and subsequent PI approximation:

    part = 16
    count = ( sc.parallelize([0] * part, part)
               .flatMap(lambda blah: [sample(p) for p in xrange( N/part)])
               .reduce(lambda a, b: a + b)
           )
    

    Finally, (because the more lazy we are the better), spark mllib actually comes already with a random data generation which is nicely parallelized, have a look here: http://spark.apache.org/docs/1.1.0/mllib-statistics.html#random-data-generation. So maybe the following is close to what you try to do (not tested => probably not working, but should hopefully be close)

    count = ( RandomRDDs.uniformRDD(sc, N, part)
            .zip(RandomRDDs.uniformRDD(sc, N, part))
            .filter (lambda (x, y): x*x + y*y < 1)
            .count()
            )
    

提交回复
热议问题