Spark: How to “reduceByKey” when the keys are numpy arrays which are not hashable?

前端 未结 1 352
旧巷少年郎
旧巷少年郎 2021-01-22 06:28

I have an RDD of (key,value) elements. The keys are NumPy arrays. NumPy arrays are not hashable, and this causes a problem when I try to do a reduceByKey operation.

相关标签:
1条回答
  • 2021-01-22 06:55

    The simplest solution is to convert it to an object that is hashable. For example:

    from operator import add
    
    reduced = sc.parallelize(data).map(
        lambda x: (tuple(x), x.sum())
    ).reduceByKey(add)
    

    and convert it back later if needed.

    Is there a way to supply the Spark context with my manual hash function

    Not a straightforward one. A whole mechanism depend on the fact object implements a __hash__ method and C extensions are cannot be monkey patched. You could try to use dispatching to override pyspark.rdd.portable_hash but I doubt it is worth it even if you consider the cost of conversions.

    0 讨论(0)
提交回复
热议问题