问题
For implementing surrogate keys in our hive data warehouse I have narrowed down to 2 options:
1) reflect('java.util.UUID','randomUUID') 2) INPUT__FILE__NAME + BLOCK__OFFSET__INSIDE__FILE
Which of the above to is a better option to go with?
Or would you suggest an even better one?
Thank you.
回答1:
For ORC and sequence files BLOCK__OFFSET__INSIDE__FILE
is not unique per file and official documentation says that it is current block's first byte's file offset
In some resources in the Internet it is said that BLOCK__OFFSET__INSIDE__FILE is unique inside Text files. Even if this is true, why should you limit yourself to TEXT files only.
Also UUID does not depend on input files and can be calculated after some transformation, in a workflow reading Kaffka topic, without files at all, etc
Also UUID generated in some other system are also unique in your system, because UUIDs are globally unique.
Also UUIDs are the same length, does not depend on file directory structure length, and INPUT__FILE__NAME
contains all file path, this makes file name unique in the same filesystem.
This is why UUID is preferable solution
回答2:
I would use built-in SURROGATE_KEYS
UDF. This has advantages over UUID. This function automatically generates numerical ids for your rows as you enter data into the table and can perform faster than UUID.
Example:
1) Create a students table in the default ORC format that has ACID properties.
CREATE TABLE students (row_id INT, name VARCHAR(64), dorm INT);
2) Insert data into the table. For example:
INSERT INTO TABLE students VALUES (1, 'fred flintstone', 100), (2, 'barney rubble', 200);
3) Create a version of the students table using the SURROGATE_KEY UDF.
CREATE TABLE students_v2
(`ID` BIGINT DEFAULT SURROGATE_KEY(),
row_id INT,
name VARCHAR(64),
dorm INT,
PRIMARY KEY (ID) DISABLE NOVALIDATE);
4) Insert data, which automatically generates surrogate keys for the primary keys.
INSERT INTO students_v2 (row_id, name, dorm) SELECT * FROM students;
5) View the surrogate keys.
SELECT * FROM students_v2;
6) Add the surrogate keys as a foreign key to another table, such as a student_grades table, to speed up subsequent joins of the tables.
ALTER TABLE student_grades ADD COLUMNS (gen_id BIGINT);
MERGE INTO student_grades g USING students_v2 s ON g.row_id = s.row_id
WHEN MATCHED THEN UPDATE SET gen_id = s.id;
7) Perform fast joins on the surrogate keys.
(NOTE: this example has been copied from Hortonworks documentation and adding it here so even it link gets removed we have an example to refer) :
There are other ways as well to have a surrogate key in your table. Here's the good thread on that discussion.
https://community.hortonworks.com/idea/8619/how-do-we-create-surrogate-keys-in-hive.html
来源:https://stackoverflow.com/questions/55104138/is-it-okay-to-use-uuid-as-surrogate-key-for-a-datawarehouse-in-hive