Hive - Is there a way to further optimize a HiveQL query?

前端 未结 4 745
梦毁少年i
梦毁少年i 2021-01-15 11:56

I have written a query to find 10 most busy airports in the USA from March to April. It produces the desired output however I want to try to further optimize it.

Ar

4条回答
  •  终归单人心
    2021-01-15 12:32

    Filter by airport(inner join) and do aggregation before UNION ALL to reduce dataset passed to the final aggregation reducer. UNION ALL subqueries with joins should run in parallel and faster than join with bigger dataset after UNION ALL.

    SELECT f.airport, SUM(cnt) AS Total_Flights
    FROM (
          SELECT a.airport, COUNT(*) as cnt 
           FROM flights_stats f
                INNER JOIN airports a ON f.Origin=a.iata AND a.country='USA'
           WHERE Cancelled = 0 AND Month IN (3,4)
           GROUP BY a.airport
           UNION ALL
          SELECT a.airport, COUNT(*) as cnt
           FROM flights_stats f
                INNER JOIN airports a ON f.Dest=a.iata AND a.country='USA'
           WHERE Cancelled = 0 AND Month IN (3,4)
           GROUP BY a.airport
         ) f 
    GROUP BY f.airport
    ORDER BY Total_Flights DESC
    LIMIT 10
    ;
    

    Tune mapjoins and enable parallel execution:

    set hive.exec.parallel=true;
    set hive.auto.convert.join=true; --this enables map-join
    set hive.mapjoin.smalltable.filesize=25000000; --size of table to fit in memory
    

    Use Tez and vectorizing, tune mappers and reducers parallelism: https://stackoverflow.com/a/48487306/2700344

提交回复
热议问题