Understanding Spark shuffle spill

前端未结

关注

 1  2026

If I understand correctly, when a reduce task goes about gathering its input shuffle blocks ( from outputs of different map tasks ) it first keeps them in memory ( Q1 ). Whe

相关标签:

1条回答

日久生厌

2021-01-12 00:12
There are differences in memory management in before and after 1.6. In both cases, there are notions of execution memory and storage memory. The difference is that before 1.6 it's static. Meaning there is a configuration parameter that specifies how much memory is for execution and for storage. And there is a spill, when either one is not enough.

One of the issues that Apache Spark has to workaround is a concurrent execution of:
- different stages that are executed in parallel
- different tasks like aggregation or sorting.
1. I'd say that your understanding is correct.
2. What's in memory is uncompressed or else it cannot be processed. Execution memory is spilled to disk in blocks and as you mentioned can be compressed.
3. Well, since 1.3.1 you can configure it, then you know the size. As of what's left at any moment in time, you can see that by looking at the executor process with something like jstat -gcutil <pid> <period>. It might give you a clue of how much memory is free there. Knowing how much memory is configured for storage and execution, having as little default.parallelism as possible might give you a clue.
4. That's true, but it's hard to reason about; there might be skew in the data such as some keys have more values than the others, there are many parallel executions, etc.
0 讨论(0)
发布评论:

提交评论
- 加载中...