I am attempting to call the reduceByKey function of pyspark on data of the format (([a,b,c], 1), ([a,b,c], 1), ([a,d,b,e], 1), ...
It seems pyspark wil
Try this:
rdd.map(lambda (k, v): (tuple(k), v)).groupByKey()
Since Python lists are mutable it means that cannot be hashed (don't provide __hash__
method):
>>> a_list = [1, 2, 3]
>>> a_list.__hash__ is None
True
>>> hash(a_list)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
Tuples from the other hand are immutable and provide __hash__
method implementation:
>>> a_tuple = (1, 2, 3)
>>> a_tuple.__hash__ is None
False
>>> hash(a_tuple)
2528502973977326415
hence can be used as a key. Similarly if you want to use unique values as a key you should use frozenset
:
rdd.map(lambda (k, v): (frozenset(k), v)).groupByKey().collect()
instead of set
.
# This will fail with TypeError: unhashable type: 'set'
rdd.map(lambda (k, v): (set(k), v)).groupByKey().collect()