How to add an integer unique id to query results - efficiently?

后端未结

关注

 4  407

面向向阳花

Given a query, select * from ... (that might be part of CTAS statement)

The goal is to add an additional column, ID, where ID is a

相关标签:

4条回答

南方客

2021-01-25 17:45
Check this solution from Manoj Kumar: https://github.com/manojkumarvohra/hive-hilo
- A stateful UDF is created which maintains a HI/LO counters to increment the sequences.
- The HI value is maintained as distribute atomic long in zookeeper.
- The HI value is incremented & fetched for every n LO (default 200) iterations.
- The UDF supports a single String argument which is the sequence name used to maintain zNodes in zookeeper.
Usage:
```
FunctionName( sequenceName, lowvalue[optional], seedvalue[optional])
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
陌清茗

2021-01-25 17:54

Check this out for a globally unique id service https://github.com/spinaki/distributed-unique-id It has a docker image too which you can test quickly.

0 讨论(0)
发布评论:

提交评论
- 加载中...

没有蜡笔的小新

2021-01-25 17:59

hive

set mapred.reduce.tasks=1000;
set hivevar:buckets=10000;

hivevar:buckets should be high enough relatively to the number of reducers (mapred.reduce.tasks), so the rows will be evenly distributed between the reduces.

select  1 + x + (row_number() over (partition by x) - 1) * ${hivevar:buckets}  as id
       ,t.*

from   (select  t.*
               ,abs(hash(rand())) % ${hivevar:buckets} as x      

        from    t
        ) t

spark-sql

select  1 + x + (row_number() over (partition by x) - 1) * 10000  as id
       ,t.*

from   (select  t.*
               ,abs(hash(rand())) % 10000 as x      

        from    t
        ) t

For both hive and spark-sql

The rand() is used to generate a good distribution.
If You already have in your query a column / combination of columns with good distribution (might be unique, not a must) you might use it instead, e.g. -

select    1 + (abs(hash(col1,col)) % 10000) 
        + (row_number() over (partition by abs(hash(col1,col)) % 10000) - 1) * 10000  as id
       ,t.*

from    t

0 讨论(0)

轮回少年

2021-01-25 18:03

If you are using Spark-sql your best bet would be to use the inbuilt function

monotonically_increasing_id

which generates unique random id in a separate column. And as you said you don't need it to be sequential so this should ideally suffice your requirement.

0 讨论(0)
发布评论:

提交评论
- 加载中...

How to add an integer unique id to query results - __efficiently__?

How to add an integer unique id to query results - efficiently?