for example I have two RDDs in PySpark:
((0,0), 1)
((0,1), 2)
((1,0), 3)
((1,1), 4)
and second is just
((0,1), 3)
((1,1), 0)
Perhaps we shouldn't think of this process as join. You're not really looking to join two datasets, you're looking to subtract one dataset from the other?
I'm going to state what I am assuming from your question
Idea 1: Cogroup (I think probably the fastest way). It's basically calculating the intersection of both datasets.
rdd1 = sc.parallelize([((0,0), 1), ((0,1), 2), ((1,0), 3), ((1,1), 4)])
rdd2 = sc.parallelize([((0,1), 3), ((1,1), 0)])
intersection = rdd1.cogroup(rdd2).filter(lambda x: x[1][0] and x[1][1])
final_rdd = intersection.map(lambda x: (x[0], list(x[1][0]))).map(lambda (x,y): (x, y[0]))
Idea 2: Subtract By Key
rdd1 = sc.parallelize([((0,0), 1), ((0,1), 2), ((1,0), 3), ((1,1), 4)])
rdd2 = sc.parallelize([((0,1), 3), ((1,1), 0)])
unwanted_rows = rdd1.subtractByKey(rdd2)
wanted_rows = rdd1.subtractByKey(unwanted_rows)
I'm not 100% sure if this is faster than your method. It does require two subtractByKey
operations, which can be slow. Also, this method does not preserve order (e.g. ((0, 1), 2)
, despite being first in your first dataset, is second in the final dataset). But I can't imagine this matters.
As to which is faster, I think it depends on how long your cartersian join takes. Mapping and filtering tend to be faster than the shuffle operations needed for subtractByKey
, but of course cartesian
is a time consuming process.
Anyway, I figure you can try out this method and see if it works for you!
A sidenote for performance improvements, depending on how large your RDDs are.
If rdd1
is small enough to be held in main memory, the subtraction process can be sped up immensely if you broadcast it and then stream rdd2
against it. However, I acknowledge that this is rarely the case.