How to add an integer unique id to query results - __efficiently__?

后端 未结 4 405
面向向阳花
面向向阳花 2021-01-25 17:08

Given a query, select * from ... (that might be part of CTAS statement)

The goal is to add an additional column, ID, where ID is a

相关标签:
4条回答
  • 2021-01-25 17:45

    Check this solution from Manoj Kumar: https://github.com/manojkumarvohra/hive-hilo

    • A stateful UDF is created which maintains a HI/LO counters to increment the sequences.
    • The HI value is maintained as distribute atomic long in zookeeper.
    • The HI value is incremented & fetched for every n LO (default 200) iterations.
    • The UDF supports a single String argument which is the sequence name used to maintain zNodes in zookeeper.

    Usage:

    FunctionName( sequenceName, lowvalue[optional], seedvalue[optional])
    
    0 讨论(0)
  • 2021-01-25 17:54

    Check this out for a globally unique id service https://github.com/spinaki/distributed-unique-id It has a docker image too which you can test quickly.

    0 讨论(0)
  • 2021-01-25 17:59

    hive

    set mapred.reduce.tasks=1000;
    set hivevar:buckets=10000;
    

    hivevar:buckets should be high enough relatively to the number of reducers (mapred.reduce.tasks), so the rows will be evenly distributed between the reduces.


    select  1 + x + (row_number() over (partition by x) - 1) * ${hivevar:buckets}  as id
           ,t.*
    
    from   (select  t.*
                   ,abs(hash(rand())) % ${hivevar:buckets} as x      
    
            from    t
            ) t
    

    spark-sql

    select  1 + x + (row_number() over (partition by x) - 1) * 10000  as id
           ,t.*
    
    from   (select  t.*
                   ,abs(hash(rand())) % 10000 as x      
    
            from    t
            ) t
    

    For both hive and spark-sql

    The rand() is used to generate a good distribution.
    If You already have in your query a column / combination of columns with good distribution (might be unique, not a must) you might use it instead, e.g. -

    select    1 + (abs(hash(col1,col)) % 10000) 
            + (row_number() over (partition by abs(hash(col1,col)) % 10000) - 1) * 10000  as id
           ,t.*
    
    from    t
    
    0 讨论(0)
  • 2021-01-25 18:03

    If you are using Spark-sql your best bet would be to use the inbuilt function

    monotonically_increasing_id

    which generates unique random id in a separate column. And as you said you don't need it to be sequential so this should ideally suffice your requirement.

    0 讨论(0)
提交回复
热议问题