I have an RDD of (key,value) elements. The keys are NumPy arrays. NumPy arrays are not hashable, and this causes a problem when I try to do a reduceByKey
operation.
The simplest solution is to convert it to an object that is hashable. For example:
from operator import add
reduced = sc.parallelize(data).map(
lambda x: (tuple(x), x.sum())
).reduceByKey(add)
and convert it back later if needed.
Is there a way to supply the Spark context with my manual hash function
Not a straightforward one. A whole mechanism depend on the fact object implements a __hash__
method and C extensions are cannot be monkey patched. You could try to use dispatching to override pyspark.rdd.portable_hash
but I doubt it is worth it even if you consider the cost of conversions.