问题
I am performing a left join between two tables with 1.3 billion records each however the join key is null in table1 for approx 600 million records and because of this all null records get allocated to 1 task ,hence data skew happens making this 1 task to run for hours.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("report").enableHiveSupport()
tbl1 = spark.sql("""select a.col1,b.col2,a.Col3
from table1 a
left join table2 b on a.col1 = b.col2""")
tbl1.write.mode("overwrite").saveAsTable("db.tbl3")
There are no other join conditions & this is the only join key to use. Is there any way that i can make spark to distribute these NULL records across different tasks instead of one or any other approach?
回答1:
There is an excellent answer by @Mikhail Dubkov that resolves just that.
I just modified it a little bit, to solve the following exception:
org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: id, id.;
Here is an example
Create tables:
case class Country(country_id: String, country_name: String)
case class Location(location_id: Int, street_address: String, city: String, country_id: String)
val countries: DataFrame = List(
Country("CN", "China"),
Country("UK", "United Kingdom"),
Country("US", "United States of America"),
Country(null, "Unknown 1"),
Country(null, "Unknown 2"),
Country(null, "Unknown 3"),
Country(null, "Unknown 4"),
Country(null, "Unknown 5"),
Country(null, "Unknown 6")
).toDF()
val locations = List(
Location(1400, "2014 Jabberwocky Rd", "Southlake", "US"),
Location(1500, "2011 Interiors Blvd", "San Francisco", "US"),
Location(1700, "2004 Charade Rd", "Seattle", "US"),
Location(2400, "8204 Arthur St", "London", "UK"),
Location(2500, "Magdalen Centre, The Oxford Science Park", "Oxford", "UK"),
Location(0, "Null Street", "Null City", null),
).toDF()
Join:
import SkewedDataFrameExt
val skewedSafeJoin = countries
.nullSkewLeftJoin(locations, "country_id")
+----------+------------------------+------------------------+-----------+----------------------------------------+-------------+----------+
|country_id|country_name |country_id_skewed_column|location_id|street_address |city |country_id|
+----------+------------------------+------------------------+-----------+----------------------------------------+-------------+----------+
|CN |China |CN |null |null |null |null |
|UK |United Kingdom |UK |2500 |Magdalen Centre, The Oxford Science Park|Oxford |UK |
|UK |United Kingdom |UK |2400 |8204 Arthur St |London |UK |
|US |United States of America|US |1700 |2004 Charade Rd |Seattle |US |
|US |United States of America|US |1500 |2011 Interiors Blvd |San Francisco|US |
|US |United States of America|US |1400 |2014 Jabberwocky Rd |Southlake |US |
|null |Unknown 1 |-9702 |null |null |null |null |
|null |Unknown 2 |-9689 |null |null |null |null |
|null |Unknown 3 |-815 |null |null |null |null |
|null |Unknown 4 |-7726 |null |null |null |null |
|null |Unknown 5 |-7826 |null |null |null |null |
|null |Unknown 6 |-8878 |null |null |null |null |
+----------+------------------------+------------------------+-----------+----------------------------------------+-------------+----------+
The other way I see to implement it is applying custom hint and adding a custom rule. Don't know if it worth the effort though. Tell me if this helps.
Modified nullSkewLeftJoin
def nullSkewLeftJoin(right: DataFrame,
usingColumn: String,
skewedColumnPostFix: String = "skewed_column",
nullNumBuckets: Int = 10000): DataFrame = {
val left = underlying
val leftColumn = left.col(usingColumn)
val rightColumn = right.col(usingColumn)
nullSkewLeftJoin(right, leftColumn, rightColumn, skewedColumnPostFix, nullNumBuckets)
}
def nullSkewLeftJoin(right: DataFrame,
joinLeftCol: Column,
joinRightCol: Column,
skewedColumnPostFix: String ,
nullNumBuckets: Int): DataFrame = {
val skewedTempColumn = s"${joinLeftCol.toString()}_$skewedColumnPostFix"
if (underlying.columns.exists(_ equalsIgnoreCase skewedTempColumn)) {
underlying.join(right.where(joinRightCol.isNotNull), col(skewedTempColumn) === joinRightCol, "left")
} else {
underlying
.withColumn(skewedTempColumn,
when(joinLeftCol.isNotNull, joinLeftCol).otherwise(negativeRandomWithin(nullNumBuckets)))
.join(right.where(joinRightCol.isNotNull), col(skewedTempColumn) === joinRightCol, "left")
}
}
}
And again all thanks to @Mikhail Dubkov
来源:https://stackoverflow.com/questions/57797559/spark-sql-1-task-running-for-long-time-due-to-null-values-is-join-key