Unique Key generation in Hive/Hadoop

后端未结

关注

 3  1975

刺人心 2021-01-22 02:37

While selecting a set of records from a big data hive table, a unique key needs to be created for each record. In a sequential mode of operation , it is easy to generate unique

3条回答

终归单人心 (楼主)

2021-01-22 03:33
If by some reason you do not want to deal with UUIDs, then this solution (based on numeric values) does not require your parallel units to "talk" to each other or synchronize whatsoever. Thus it is very efficient, but it does not guarantee that your integer keys are going to be continuous.

If you have say N parallel units of execution, and you know your N, and each unit is assigned an ID from 0 to N - 1, then you can simply generate a unique integer across all units
```
Unit #0:   0, N, 2N, 3N, ...
Unit #1:   1, N+1, 2N+1, 3N+1, ...
...
Unit #N-1: N-1, N+(N-1), 2N+(N-1), 3N+(N-1), ...
```
Depending on where you need to generate keys (mapper or reducer) you can get your N from hadoop configuration:
```
Mapper: mapred.map.tasks
Reduce: mapred.reduce.tasks
```
... and ID of your unit: In Java, it is:
```
 context.getTaskAttemptID().getTaskID().getId()
```
Not sure about Hive, but it should be possible as well.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...