Spark Dataset/Dataframe join NULL skew key
问题 Working with Spark Dataset/DataFrame joins, I faced long running and failed with OOM jobs. Here's input: ~10 datasets with different size, mostly huge(>1 TB) all left-joined to one base dataset some of join keys are null After some analysis, I found that failed and slow jobs reason is null skew key: when left side has millions of records with join key null . I made some brute force approach to solve this issue, and here's I want to share it. If you have better or any built-in solutions(for