问题
I found that the Hive does not support non-equi join.Is it just because it is difficult to convert non-equi join to Map reduce?
回答1:
Yes, the problem is in current map-reduce implementation.
How common equi-join is implemented in MapReduce?
Input records are being copied in chunks to the mappers, mappers produce output as key-value pairs, which are collected and distributed between reducers using some function in such way that each reducer will process the whole key, in other words, mapper creates a list of key-values for each reducer grouped by key. Reducers copy mappers output, sort it to get <key, list of values>. The same is being done for both datasets. Then reducer applies cross-product on both lists with equal keys. In such way the equi-join is implemented. The main idea here is that tuples with the same join key are distributed to the same instance of reducer and being processed on the same reducer. This is easy to implement because key itself determines on which reducer it will be processed (computation is based on key-equality) and each reducer instance receives it's dedicated key list from both datasets, no other reducers are working with the same keys.
Consider non-equi-join: For example we need to join datasets A and B on A.key<=B.key condition. In this case the reducer should receive not only tuples with equal keys from both datasets, but also for each B.key all A tuples with key less then B.key. It is difficult to implement using the same key equality paradigm.
If reducer will receive for each A.key B-tuples with B.key < A.key
than it will cause huge duplication of data on reducer. for example if we have A keys (1, 2, 3) and B keys (1,2,3) then for A.3 we need [A.1, A.2, A.3]
. For A.2 we need [A.1, A.2]
. In other words, the mapper need to produce a duplication for each particular key, lists produced by mappers for different keys will be overlapped. The more distinct keys we have the bigger duplication it will be.
Read this paper for deep dive into problems and possible solutions: Processing Theta-Joins using MapReduce
来源:https://stackoverflow.com/questions/64236121/why-hive-can-not-support-non-equi-join