Given a query, select * from ...
(that might be part of CTAS statement)
The goal is to add an additional column, ID
, where ID
is a
Check this solution from Manoj Kumar: https://github.com/manojkumarvohra/hive-hilo
Usage:
FunctionName( sequenceName, lowvalue[optional], seedvalue[optional])
Check this out for a globally unique id service https://github.com/spinaki/distributed-unique-id It has a docker image too which you can test quickly.
hive
set mapred.reduce.tasks=1000;
set hivevar:buckets=10000;
hivevar:buckets
should be high enough relatively to the number of reducers (mapred.reduce.tasks
), so the rows will be evenly distributed between the reduces.
select 1 + x + (row_number() over (partition by x) - 1) * ${hivevar:buckets} as id
,t.*
from (select t.*
,abs(hash(rand())) % ${hivevar:buckets} as x
from t
) t
spark-sql
select 1 + x + (row_number() over (partition by x) - 1) * 10000 as id
,t.*
from (select t.*
,abs(hash(rand())) % 10000 as x
from t
) t
For both hive and spark-sql
The rand()
is used to generate a good distribution.
If You already have in your query a column / combination of columns with good distribution (might be unique, not a must) you might use it instead, e.g. -
select 1 + (abs(hash(col1,col)) % 10000)
+ (row_number() over (partition by abs(hash(col1,col)) % 10000) - 1) * 10000 as id
,t.*
from t
If you are using Spark-sql your best bet would be to use the inbuilt function
monotonically_increasing_id
which generates unique random id in a separate column. And as you said you don't need it to be sequential so this should ideally suffice your requirement.