Spark Task Memory allocation

后端 未结 1 1123
后悔当初
后悔当初 2021-02-03 14:11

I am trying to find out the best way to configure the memory on the nodes of my cluster. However I believe that for that there some things that I need to further understand such

1条回答
  •  说谎
    说谎 (楼主)
    2021-02-03 15:02

    Why should I increase number of tasks (partitions)?

    I would like to answer first on the last question that is confusing you. Here is a quote from another question:

    Spark does not need to load everything in memory to be able to process it. This is because Spark will partition the data into smaller blocks and operate on these separately.

    In fact, by default Spark tries to split input data automatically into some optimal number of partitions:

    Spark automatically sets the number of “map” tasks to run on each file according to its size

    One can specify number of partitions of the operation that is being performed (like for cogroup: def cogroup[W](other: RDD[(K, W)], numPartitions: Int)), and also do a .repartition() after any RDD transformation.

    Moreover, later in the same paragraph of the documentation they say:

    In general, we recommend 2-3 tasks per CPU core in your cluster.

    In summary:

    1. the default number of partitions is a good start;
    2. 2-3 partitions per CPU is generally recommended.

    How does Spark deal with inputs that do not fit in memory?

    In short, by partitioning input and intermediate results (RDDs). Usually each small chunk fits in memory available for the executor and is processed fastly.

    Spark is capable of caching the RDDs it has computed. By default every time an RDD is being reused it will be recomputed (is not cached); calling .cache() or .persist() can help to keep the result already computed in-memory or on disk.

    Internally each executor has a memory pool that floats between execution and storage (see here for more details). When there is not enough memory for a task execution, Spark first tries to evict some storage cache, and then spills task data on disk. See these slides for further details. Balancing between executor and storage memory is well described in this blog post, which also has a nice illustration:

    OutOfMemory often happens not directly because of large input data, but because of poor partitioning and hence large auxiliary data structures, like HashMap on reducers (here documentation again advises to have more partitions than executors). So, no, OutOfMemory will not happen just because of big input, it may be very slow to process though (since it will have to write/read from disk). They also suggest that using tasks as small as 200ms (in running time) is Ok for Spark.

    Outline: split your data properly: more than 1 partition per core, running time of each task should be >200 ms. Default partitioning is a good starting point, tweak the parameters manually.

    (I would suggest to use a 1/8 subset of input data on a 1/8 cluster to find optimal number of partitions.)

    Do tasks within same executor affect each other?

    Short answer: they do. For more details, check out the slides I mentioned above (starting from slide #32).

    All N tasks get N-th portion of the memory available, hence affect each other's "parallelism". If I interpret your idea of true parallelism well, it is "full utilization of CPU resources". In this case, yes, small pool of memory will result in spilling data on disk and the computations becoming IO-bound (instead of being CPU-bound).

    Further reading

    I would highly recommend the entire chapter Tuning Spark and Spark Programming Guide in general. See also this blog post on Spark Memory Management by Alexey Grishchenko.

    0 讨论(0)
提交回复
热议问题