Why is this simple Spark program not utlizing multiple cores?

前端未结

关注

 4  1135

So, I\'m running this simple program on a 16 core multicore system. I run it by issuing the following.

spark-submit --master local[*] pi.py

And

相关标签:

4条回答

遇见更好的自我

2021-02-10 02:14

To change the CPU core consumption, set the number of cores to be used by the workers in the spark-env.sh file in spark-installation-directory/conf This is done with the SPARK_EXECUTOR_CORES attribute in spark-env.sh file. The value is set to 1 by default.

0 讨论(0)
发布评论:

提交评论
- 加载中...
野性不改

2021-02-10 02:20
Probably because the call to sc.parallelize puts all the data into one single partition. You can specify the number of partitions as 2nd argument to parallelize:
```
part = 16
count = sc.parallelize(xrange(N), part).map(sample).reduce(lambda a, b: a + b)
```
Note that this would still generate the 12 millions points with one CPU in the driver and then only spread them out to 16 partitions to perform the reduce step.

A better approach would try to do most of the work after the partitioning: for example the following generates only a tiny array on the driver and then lets each remote task generate the actual random numbers and subsequent PI approximation:
```
part = 16
count = ( sc.parallelize([0] * part, part)
           .flatMap(lambda blah: [sample(p) for p in xrange( N/part)])
           .reduce(lambda a, b: a + b)
       )
```
Finally, (because the more lazy we are the better), spark mllib actually comes already with a random data generation which is nicely parallelized, have a look here: http://spark.apache.org/docs/1.1.0/mllib-statistics.html#random-data-generation. So maybe the following is close to what you try to do (not tested => probably not working, but should hopefully be close)
```
count = ( RandomRDDs.uniformRDD(sc, N, part)
        .zip(RandomRDDs.uniformRDD(sc, N, part))
        .filter (lambda (x, y): x*x + y*y < 1)
        .count()
        )
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
闹比i

2021-02-10 02:25

I tried the method mentioned by @Svend, but still does not work.

The following works for me:

Do NOT use the local url, for example:

sc = SparkContext("local", "Test App").

Use the master URL like this:

sc = SparkContext("spark://your_spark_master_url:port", "Test App")

0 讨论(0)
发布评论:

提交评论
- 加载中...
被撕碎了的回忆

2021-02-10 02:28

As none of the above really worked for me (maybe because I didn't really understand them), here is my two cents.

I was starting my job with spark-submit program.py and inside the file I had sc = SparkContext("local", "Test"). I tried to verify the number of cores spark sees with sc.defaultParallelism. It turned out that it was 1. When I changed the context initialization to sc = SparkContext("local[*]", "Test") it became 16 (the number of cores of my system) and my program was using all the cores.

I am quite new to spark, but my understanding is that local by default indicates the use of one core and as it is set inside the program, it would overwrite the other settings (for sure in my case it overwrites those from configuration files and environment variables).

0 讨论(0)
发布评论:

提交评论
- 加载中...