Hive clustered by on more than one column

后端 未结 2 1225
执笔经年
执笔经年 2021-01-02 11:56

I understand that when the hive table has clustered by on one column, then it performs a hash function of that bucketed column and then puts that row of data into one of the

相关标签:
2条回答
  • 2021-01-02 12:31
    1. Yes the number of files will still be 32.
    2. Hash function will operate by considering "continent,country" as a single string and then will use this as input.

    Hope it helps!!

    0 讨论(0)
  • 2021-01-02 12:48

    In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. (There's a '0x7FFFFFFF in there too, but that's not that important). The hash_function depends on the type of the bucketing column. For an int, it's easy, hash_int(i) == i. For example, if user_id were an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other datatypes, it's a little tricky. In particular, the hash of a BIGINT is not the same as the BIGINT. And the hash of a string or a complex datatype will be some number that's derived from the value, but not anything humanly-recognizable. For example, if user_id were a STRING, then the user_id's in bucket 1 would probably not end in 0. In general, distributing rows based on the hash will give you a even distribution in the buckets.

    ref: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables

    0 讨论(0)
提交回复
热议问题