pig skewed join with a big table causes “Split metadata size exceeded 10000000”

给你一囗甜甜゛ 提交于 2020-01-03 13:08:22

问题


We have a pig join between a small (16M rows) distinct table and a big (6B rows) skewed table. A regular join finishes in 2 hours (after some tweaking). We tried using skewed and been able to improve the performance to 20 minutes.

HOWEVER, when we try a bigger skewed table (19B rows), we get this message from the SAMPLER job:

Split metadata size exceeded 10000000. Aborting job job_201305151351_21573 [ScriptRunner]
at org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:48)
at org.apache.hadoop.mapred.JobInProgress.createSplits(JobInProgress.java:817) [ScriptRunner]

This is reproducible every time we try using skewed, and does not happen when we use the regular join.

we tried setting mapreduce.jobtracker.split.metainfo.maxsize=-1 and we can see it's there in the job.xml file, but it doesn't change anything!

What's happening here? Is this a bug with the distribution sample created by using skewed? Why doesn't it help changing the param to -1?


回答1:


Small table of 1MB is small enough to fit into memory, try replicated join. Replicated join is Map only, does not cause Reduce stage as other types of join, thus is immune to the skew in the join keys. It should be quick.

big = LOAD 'big_data' AS (b1,b2,b3);
tiny = LOAD 'tiny_data' AS (t1,t2,t3);
mini = LOAD 'mini_data' AS (m1,m2,m3);
C = JOIN big BY b1, tiny BY t1, mini BY m1 USING 'replicated';

Big table is always the first one in the statement.

UPDATE 1: If small table in its original form does not fit into memory,than as a work around you would need to partition your small table into partitions that are small enough to fit into memory and than apply the same partitioning to the big table, hopefully you could add the same partitioning algorithm to the system which creates big table, so that you do not waste time repartitioning it. After partitioning, you can use replicated join, but it will require running pig script for each partition separately.




回答2:


In newer versions of Hadoop (>=2.4.0 but maybe even earlier) you should be able to set the maximum split size at the job level by using the following configuration property:

mapreduce.job.split.metainfo.maxsize=-1



来源:https://stackoverflow.com/questions/17163112/pig-skewed-join-with-a-big-table-causes-split-metadata-size-exceeded-10000000

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!