A list as a key for PySpark's reduceByKey

前端 未结 1 492
名媛妹妹
名媛妹妹 2020-11-29 11:31

I am attempting to call the reduceByKey function of pyspark on data of the format (([a,b,c], 1), ([a,b,c], 1), ([a,d,b,e], 1), ...

It seems pyspark wil

相关标签:
1条回答
  • 2020-11-29 12:02

    Try this:

    rdd.map(lambda (k, v): (tuple(k), v)).groupByKey()
    

    Since Python lists are mutable it means that cannot be hashed (don't provide __hash__ method):

    >>> a_list = [1, 2, 3]
    >>> a_list.__hash__ is None
    True
    >>> hash(a_list)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: unhashable type: 'list'
    

    Tuples from the other hand are immutable and provide __hash__ method implementation:

    >>> a_tuple = (1, 2, 3)
    >>> a_tuple.__hash__ is None
    False
    >>> hash(a_tuple)
    2528502973977326415
    

    hence can be used as a key. Similarly if you want to use unique values as a key you should use frozenset:

    rdd.map(lambda (k, v): (frozenset(k), v)).groupByKey().collect()
    

    instead of set.

    # This will fail with TypeError: unhashable type: 'set'
    rdd.map(lambda (k, v): (set(k), v)).groupByKey().collect()
    
    0 讨论(0)
提交回复
热议问题