I\'m sure this is something very simple but I didn\'t find anything related to this.
My code is simple:
...
stream = stream.map(mapper)
stream =
Alternatively, stream.groupByKey().mapValues(lambda x: list(x)).collect()
gives
key1 [value1]
key2 [value2, value3]
key3 [value4, value5, value6]
The problem here is your reduce function. For each key, reduceByKey
calls your reduce function with pairs of values and expects it to produce combined values of the same type.
For example, say that I wanted to perform a word count operation. First, I can map each word to a (word, 1)
pair, then I can reduceByKey(lambda x, y: x + y)
to sum up the counts for each word. At the end, I'm left with an RDD of (word, count)
pairs.
Here's an example from the PySpark API Documentation:
>>> from operator import add
>>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
>>> sorted(rdd.reduceByKey(add).collect())
[('a', 2), ('b', 1)]
To understand why your example didn't work, you can imagine the reduce function being applied something like this:
reduce(reduce(reduce(firstValue, secondValue), thirdValue), fourthValue) ...
Based on your reduce function, it sounds like you might be trying to implement the built-in groupByKey operation, which groups each key with a list of its values.
Also, take a look at combineByKey, a generalization of reduceByKey()
that allows the reduce function's input and output types to differ (reduceByKey
is implemented in terms of combineByKey
)