Pig Script: Join with multiple files

后端未结

关注

 1  423

别那么骄傲 2021-01-15 23:47

I am reading a big file (more than a billion records) and joining it with three other files, I was wondering if there is anyway the process can be made more efficient to avo

1条回答

北海茫月 (楼主)

2021-01-16 00:11

Pig 0.10 introduced integration with Bloom Filters http://search-hadoop.com/c/Pig:/src/org/apache/pig/builtin/Bloom.java%7C%7C+%2522done+%2522exec+Tuple%2522

You can train a bloom filter on the 3 smaller files and filter big file, hopefully it will result in a smaller file. After that perform standard joins to get 100% precision.

UPDATE 1 You would actually need to train 2 Bloom Filters, one for each of the small tables, as you join on different keys.

UPDATE 2 It was mentioned in the comments that the outer join is used for augmenting data. In this case Bloom Filters might not be the best thing, they are good for filtering and not adding data in outer joins, as you want to keep the non matched data. A better approach would be to partition all small tables on respective fields (f1, f2, f3, f4), store each partition into a separate file small enough to load into memory. Than Group BY massive table on f1, f2, f3, f4 and in a FOREACH pass the group (f1, f2, f3, f4) with associated bag to the custom function written in Java, that loads the respective partitions of the small files into RAM and performs augmentation.

0 讨论(0)
发布评论:

提交评论
- 加载中...