What will happen if Hive number of reducers is different to number of keys?

孤街浪徒 提交于 2021-01-28 02:14:24

问题


In Hive I ofter do queries like:

select columnA, sum(columnB) from ... group by ...

I read some mapreduce example and one reducer can only produce one key. It seems the number of reducers completely depends on number of keys in columnA.

Therefore, why could hive set number of reducers manully?

If there are 10 different values in columnA and I set number of reducers to 2, what will happen? Each reducers will be reused 5 times?

If there are 10 different values in columnA and I set number of reducers to 20, what will happen? hive will only generate 10 reducers?


回答1:


Normally you should not set the exact number of reducers manually. Use bytes.per.reducer instead:

--The number of reduce tasks determined at compile time
--Default size is 1G, so if the input size estimated is 10G then 10 reducers will be used
set hive.exec.reducers.bytes.per.reducer=67108864; 

If you want to limit cluster usage by job reducers, you can set this property: hive.exec.reducers.max

If you are running on Tez, at execution time Hive can dynamically set the number of reducers if this property is set:

set hive.tez.auto.reducer.parallelism = true;

In this case the number of reducers initially started may be bigger because it was estimated based on size, at runtime extra reducers can be removed.

One reducer can process many keys, it depends on data size and bytes.per.reducer and reducer limit configuration settings. The same keys will pass to the same reducer in case of query like in your example because each reducer container is running isolated and all rows having particular key need to be passed to single reducer to be able calculate count for this key.

Extra reducers can be forced (mapreduce.job.reducers=N) or started automatically based on wrong estimation(because of stale stats) and if not removed at run-time, they will do nothing and finish quickly because there is nothing to process. But such reducers anyway will be scheduled and containers allocated, so better do not force extra reducers and keep stats fresh for better estimation.



来源:https://stackoverflow.com/questions/62369975/what-will-happen-if-hive-number-of-reducers-is-different-to-number-of-keys

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!