Spark: How to “reduceByKey” when the keys are numpy arrays which are not hashable?

前端未结

关注

 1  352

I have an RDD of (key,value) elements. The keys are NumPy arrays. NumPy arrays are not hashable, and this causes a problem when I try to do a reduceByKey operation.

相关标签:

1条回答

无人共我

2021-01-22 06:55
The simplest solution is to convert it to an object that is hashable. For example:
```
from operator import add

reduced = sc.parallelize(data).map(
    lambda x: (tuple(x), x.sum())
).reduceByKey(add)
```
and convert it back later if needed.

Is there a way to supply the Spark context with my manual hash function

Not a straightforward one. A whole mechanism depend on the fact object implements a __hash__ method and C extensions are cannot be monkey patched. You could try to use dispatching to override pyspark.rdd.portable_hash but I doubt it is worth it even if you consider the cost of conversions.
0 讨论(0)
发布评论:

提交评论
- 加载中...