问题
I am using spark sql for processing the data. Here is the query
select
/*+ BROADCAST (C) */ A.party_id,
IF(B.master_id is NOT NULL, B.master_id, 'MISSING_LINK') as master_id,
B.is_matched,
D.partner_name,
A.partner_id,
A.event_time_utc,
A.funnel_stage_type,
A.product_id_set,
A.ip_address,
A.session_id,
A.tdm_retailer_id,
C.product_name ,
CASE WHEN C.product_category_lvl_01 is NULL THEN 'OUTOFSALE' ELSE product_category_lvl_01 END as product_category_lvl_01,
CASE WHEN C.product_category_lvl_02 is NULL THEN 'OUTOFSALE' ELSE product_category_lvl_02 END as product_category_lvl_02,
CASE WHEN C.product_category_lvl_03 is NULL THEN 'OUTOFSALE' ELSE product_category_lvl_03 END as product_category_lvl_03,
CASE WHEN C.product_category_lvl_04 is NULL THEN 'OUTOFSALE' ELSE product_category_lvl_04 END as product_category_lvl_04,
C.brand_name
from
browser_data A
INNER JOIN (select partner_name, partner_alias_tdm_id as npa_retailer_id from npa_retailer) D
ON (A.tdm_retailer_id = D.npa_retailer_id)
LEFT JOIN
(identity as B1 INNER JOIN (select random_val from random_distribution) B2) as B
ON (A.party_id = B.party_id and A.random_val = B.random_val)
LEFT JOIN product_taxonomy as C
ON (A.product_id = C.product_id and D.npa_retailer_id = C.retailer_id)
Where, browser_data A - Its around 110 GB data with 519 million records,
D - Small dataset which maps to only one value in A. As this is small spark sql automatically broadcast it (confirmed in the execution plan in explain)
B - 5 GB with 45 million records contains only 3 columns. This dataset is replicated 30 times (done with cartesian product with dataset which contains 0 to 29 values) so that skewed key (lot of data against one in dataset A) issue is solved.This will result in 150 GB of data.
C - 900 MB with 9 million records. This is joined with A with broadcast join (so no shuffle)
Above query works well. But when I see spark UI I can observe above query triggers shuffle read of 6.8 TB. As dataset D and C are joined as broadcast it wont cause any shuffle. So only join of A and B should cause the shuffle. Even if we consider all data shuffled read then it should be limited to 110 GB (A) + 150 GB (B) = 260 GB. Why it is triggering 6.8 TB of shuffle read and 40 GB of shuffle write. Any help appreciated. Thank you in advance
Thank you
Manish
回答1:
The first thing I would do is use DataFrame.explain
on it. That will show you the execution plan so you can see exactly what is actually happen. I would check the output to confirm that the Broadcast Join is really happening. Spark has a setting to control how big your data can be before it gives up on doing a broadcast join.
I would also note that your INNER JOIN against the random_distribution looks suspect. I may have recreated your schema wrong, but when I did explain I got this:
scala> spark.sql(sql).explain
== Physical Plan ==
org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
LocalRelation [party_id#99]
and
LocalRelation [random_val#117]
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
Finally, is your input data compressed? You may be seeing the size differences because of a combination of no your data no longer being compressed, and because of the way it is being serialized.
来源:https://stackoverflow.com/questions/50967218/spark-sql-query-causing-huge-data-shuffle-read-write