Why is the final reduce step extremely slow in this MapReduce? (HiveQL, HDFS MapReduce)

前端 未结 1 970
说谎
说谎 2020-11-29 11:02

Some background information:

I\'m working with Dataiku DSS, HDFS, and partitioned datasets. I have a particular job running (Hive query) which has t

相关标签:
1条回答
  • 2020-11-29 11:13

    If final reducer is a join then it looks like skew in join key. First of all check two things:

    check that b.f1 join key has no duplicates:

    select b.f1, count(*) cnt from B b 
     group by b.f1 
    having count(*)>1 order by cnt desc;
    

    check the distribution of a.f1:

    select a.f1, count(*) cnt from A a
     group by a.f1  
    order by cnt desc
    limit 10;
    

    This query will show skewed keys.

    If there is a skew (too many rows with the same value) then join skewed keys separately, use union all:

    SELECT a.f1, f2, ..., fn
      FROM ( select * from A where f1 = skewed_value) as a --skewed
      LEFT JOIN B as b
      ON a.f1 = b.f1
    WHERE {PARTITION_FILTER}
    UNION ALL
    SELECT a.f1, f2, ..., fn
      FROM ( select * from A where f1 != skewed_value) as a --all other
      LEFT JOIN B as b
      ON a.f1 = b.f1
    WHERE {PARTITION_FILTER}
    

    And finally if there is no issues with skew and duplication, then try to increase reducers parallelism: Get current bytes per reducer configuration

    set hive.exec.reducers.bytes.per.reducer;

    typically this will return some value about 1G. Try to divide by two, set new value before your query and check how many reducers will it start and performance. Success criteria is more reducers has started and performance improved.

    set hive.exec.reducers.bytes.per.reducer=67108864;
    

    The less the bytes per reducer the more reducers will be started, increasing parallelism;

    UPDATE: Try to enable map-join, your second table is small enough to fit in memory, mapjoin will work without reducers at all and it will be no problem with skew on reducers.

    How to enable mapjoin: https://stackoverflow.com/a/49154414/2700344

    0 讨论(0)
提交回复
热议问题