Hive clustered by on more than one column

后端未结

关注

 2  1225

I understand that when the hive table has clustered by on one column, then it performs a hash function of that bucketed column and then puts that row of data into one of the

相关标签:

2条回答

感情败类

2021-01-02 12:31
1. Yes the number of files will still be 32.
2. Hash function will operate by considering "continent,country" as a single string and then will use this as input.
Hope it helps!!
0 讨论(0)
发布评论:

提交评论
- 加载中...
孤街浪徒

2021-01-02 12:48

In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. (There's a '0x7FFFFFFF in there too, but that's not that important). The hash_function depends on the type of the bucketing column. For an int, it's easy, hash_int(i) == i. For example, if user_id were an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other datatypes, it's a little tricky. In particular, the hash of a BIGINT is not the same as the BIGINT. And the hash of a string or a complex datatype will be some number that's derived from the value, but not anything humanly-recognizable. For example, if user_id were a STRING, then the user_id's in bucket 1 would probably not end in 0. In general, distributing rows based on the hash will give you a even distribution in the buckets.

ref: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables

0 讨论(0)
发布评论:

提交评论
- 加载中...