How can we decide the total no. of buckets for a hive table

╄→尐↘猪︶ㄣ 提交于 2020-03-19 05:24:47

问题


i am bit new to hadoop. As per my knowledge buckets are fixed no. of partitions in hive table and hive uses the no. of reducers same as the total no. of buckets defined while creating the table. So can anyone tell me how to calculate the total no. of buckets in a hive table. Is there any formula for calculating the total number of buckets ?


回答1:


Lets take a scenario Where table size is: 2300 MB, HDFS Block Size: 128 MB

Now, Divide 2300/128=17.96

Now, remember number of bucket will always be in the power of 2.

So we need to find n such that 2^n > 17.96

n=5

So, I am going to use number of buckets as 2^5=32

Hope, It will help some of you.




回答2:


From the documentation link

In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. (There's a '0x7FFFFFFF in there too, but that's not that important). The hash_function depends on the type of the bucketing column. For an int, it's easy, hash_int(i) == i. For example, if user_id were an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other datatypes, it's a little tricky. In particular, the hash of a BIGINT is not the same as the BIGINT. And the hash of a string or a complex datatype will be some number that's derived from the value, but not anything humanly-recognizable. For example, if user_id were a STRING, then the user_id's in bucket 1 would probably not end in 0. In general, distributing rows based on the hash will give you a even distribution in the buckets.




回答3:


If you want to know how many buckets you should choose in your CLUSTER BY clause, I believe it is good to choose a number that results in buckets that are at or just below your HDFS block size.

This should help avoid having HDFS allocate memory to files that are mostly empty.

Also choose a number that is a power of two.

You can check your HDFS block size with:

hdfs getconf -confKey dfs.blocksize



回答4:


optimal bucket number is ( B * HashTableSize of Table ) / Total Memory of Node, B=1.01



来源:https://stackoverflow.com/questions/30730567/how-can-we-decide-the-total-no-of-buckets-for-a-hive-table

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!