Is there a way to identify or detect data skew in Hive table?

前端 未结 1 1344
迷失自我
迷失自我 2021-01-17 02:29

We have many hive queries that take lot of time. We are using tez and other good practices like CBO, using orc files etc.

Is there a way to check / analyze data skew

相关标签:
1条回答
  • 2021-01-17 02:41

    Explain plan will not help in this, you should check data. If it is a join, select top 100 join key value from all tables involved in the join, do the same for partition by key if it is analytic function and you will see if it is a skew.

    Example:

    select key, count(*) cnt
       from table
      group by key
     having count(*)> 1000 --check also >1 for tables where it should not be duplication (like dimentions)
      order by cnt desc limit 100;
    

    key can be complex join key (all columns you are using in the join ON condition).

    Also have a look at this answer: https://stackoverflow.com/a/51061613/2700344

    0 讨论(0)
提交回复
热议问题