问题
I have a HDFS file with 50 Million records and raw file size is 50 GB.
I am trying to load this in a hive table and create unique id for all rows using the below, while loading. I am using Hive 1.1.0-cdh5.16.1.
row_number() over(order by event_id, user_id, timestamp) as id
While executing I see that in the reduce step, 40 reducers are assigned. Average time for 39 Reducers is about 2 mins whereas the last reducer takes about 25 mins which clearly makes me believe that most of the data is processed in one reducer.
I suspected Order By clause to be the reason for this behavior and tried the below,
row_number() over() as id
Yet, I see the same behavior.
Thinking about the Map Reduce Paradigm, it makes me feel that if we do not specify a Partition BY Clause, the data has to be processed in one reducer (un-distributed) in order to see all rows and attach the correct row number. This could be true for any Window function with no partition By clause or partition By on skewed column.
Now, my question is, how do we circumvent this problem and optimize window functions when we have to avoid Partition BY clause?
回答1:
You can use UUID:
select java_method('java.util.UUID','randomUUID')
UUID generated in your system/workflow will be also unique in some other system because UUID is globally unique. UUID works fully distributed and fast.
Also in Hive 3.x there is SURROGATE_KEY function which you can use in the DDL
回答2:
@leftjoin's Suggestion worked like a charm (BIG THANK YOU!) for this usecase. It did not involve a Reduce step and the job completed in less than 3 mins. I tested and it is indeed producing unique ID. Will check the underlying code since it is very intriguing that it is able to produce unique ID even with 500+ mappers.
Since I am using Hive 1.1, I could not try SURROGATE_KEY
Unfortunately, @Strick's suggestion did not work, but thanks for sharing. Using Cluster By did not produce Unique ID. All rows were tagged with 1 since my cluster by clause had Natural Key. Sort By behavior was similar to Order By behavior in results and performance (32 Mins to complete). Perhaps data is funneled thru one reducer which means Sort By in this case is equivalent to Order By. (I am not sure though)
Still looking for a solution on Window Function which might not have a Partition By Clause but should be distributed
回答3:
Instead of order by try sort by
or cluster by
来源:https://stackoverflow.com/questions/58624633/hive-window-function-row-number-without-partition-by-clause-on-a-large-50-gb-d