Pig Script: Join with multiple files

后端 未结 1 424
别那么骄傲
别那么骄傲 2021-01-15 23:47

I am reading a big file (more than a billion records) and joining it with three other files, I was wondering if there is anyway the process can be made more efficient to avo

相关标签:
1条回答
  • 2021-01-16 00:11

    Pig 0.10 introduced integration with Bloom Filters http://search-hadoop.com/c/Pig:/src/org/apache/pig/builtin/Bloom.java%7C%7C+%2522done+%2522exec+Tuple%2522

    You can train a bloom filter on the 3 smaller files and filter big file, hopefully it will result in a smaller file. After that perform standard joins to get 100% precision.

    UPDATE 1 You would actually need to train 2 Bloom Filters, one for each of the small tables, as you join on different keys.

    UPDATE 2 It was mentioned in the comments that the outer join is used for augmenting data. In this case Bloom Filters might not be the best thing, they are good for filtering and not adding data in outer joins, as you want to keep the non matched data. A better approach would be to partition all small tables on respective fields (f1, f2, f3, f4), store each partition into a separate file small enough to load into memory. Than Group BY massive table on f1, f2, f3, f4 and in a FOREACH pass the group (f1, f2, f3, f4) with associated bag to the custom function written in Java, that loads the respective partitions of the small files into RAM and performs augmentation.

    0 讨论(0)
提交回复
热议问题