Hive clustered by on more than one column

后端 未结 2 1223
执笔经年
执笔经年 2021-01-02 11:56

I understand that when the hive table has clustered by on one column, then it performs a hash function of that bucketed column and then puts that row of data into one of the

2条回答
  •  孤街浪徒
    2021-01-02 12:48

    In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. (There's a '0x7FFFFFFF in there too, but that's not that important). The hash_function depends on the type of the bucketing column. For an int, it's easy, hash_int(i) == i. For example, if user_id were an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other datatypes, it's a little tricky. In particular, the hash of a BIGINT is not the same as the BIGINT. And the hash of a string or a complex datatype will be some number that's derived from the value, but not anything humanly-recognizable. For example, if user_id were a STRING, then the user_id's in bucket 1 would probably not end in 0. In general, distributing rows based on the hash will give you a even distribution in the buckets.

    ref: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables

提交回复
热议问题