Why is this simple Spark program not utlizing multiple cores?

前端未结

关注

 4  1136

情深已故 2021-02-10 01:39

So, I\'m running this simple program on a 16 core multicore system. I run it by issuing the following.

spark-submit --master local[*] pi.py

And

4条回答

野性不改 (楼主)

2021-02-10 02:20
Probably because the call to sc.parallelize puts all the data into one single partition. You can specify the number of partitions as 2nd argument to parallelize:
```
part = 16
count = sc.parallelize(xrange(N), part).map(sample).reduce(lambda a, b: a + b)
```
Note that this would still generate the 12 millions points with one CPU in the driver and then only spread them out to 16 partitions to perform the reduce step.

A better approach would try to do most of the work after the partitioning: for example the following generates only a tiny array on the driver and then lets each remote task generate the actual random numbers and subsequent PI approximation:
```
part = 16
count = ( sc.parallelize([0] * part, part)
           .flatMap(lambda blah: [sample(p) for p in xrange( N/part)])
           .reduce(lambda a, b: a + b)
       )
```
Finally, (because the more lazy we are the better), spark mllib actually comes already with a random data generation which is nicely parallelized, have a look here: http://spark.apache.org/docs/1.1.0/mllib-statistics.html#random-data-generation. So maybe the following is close to what you try to do (not tested => probably not working, but should hopefully be close)
```
count = ( RandomRDDs.uniformRDD(sc, N, part)
        .zip(RandomRDDs.uniformRDD(sc, N, part))
        .filter (lambda (x, y): x*x + y*y < 1)
        .count()
        )
```
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...