问题
I need to insert data to a given external table which should be partitioned by the inserted date. My question is how is Hive handling the timestamp generation? When I select a timestamp for all inserted records like this:
WITH delta_insert AS (
SELECT trg.*, from_unixtime(unix_timestamp()) AS generic_timestamp
FROM target_table trg
)
SELECT *
FROM delta_insert;
Will the timestamp always be identical for all records, even if the query takes a lot of time to un?
Or should I alternatively only select an actual timestamp and join it with everything that is selected afterwards?
WITH insert_timestamp AS (
SELECT from_unixtime(unix_timestamp()) AS insert_timestamp
),
delta_insert AS (
SELECT trg.*, insert_timestamp.insert_timestamp
FROM target_table trg, insert_timestamp
)
SELECT *
FROM delta_insert;
Since it's not recommended to do cross joins in Hive, I wonder what would be the best approach, since I don't have enough test data to simulate long running queries to be sure they get inserted within the same partition of a delta load.
回答1:
No, the timestamp generated using unix_timestamp() WILL NOT always be identical for all records. This function is non-deterministic and prevents proper optimization of queries - this has been deprecated since 2.0 in favour of CURRENT_TIMESTAMP constant (calculated only one time for the query). See this docs: Date functions
This not applies to unix_timestamp(string date)
with argument. unix_timestamp(string date)
function is deterministic.
current_timestamp
returns the current timestamp at the start of query evaluation (as of Hive 1.2.0). All calls of current_timestamp within the same query return the same value.
If you need date, use current_date
function.
来源:https://stackoverflow.com/questions/58078690/dynamic-partitioning-in-hive-through-the-exact-inserted-timestamp